deployment

11 concepts in this domain

API Gateway
System

A server that acts as an intermediary between clients and backend services, handling cross-cutting concerns like authentication, rate limiting, logging, and routing. For LLM applications, API gateways...

Also: Gateway, LLM Gateway, AI Gateway
Batch Inference
Process

Processing many inference requests together as a batch rather than one at a time. Batch inference optimizes for throughput and cost rather than latency— it's appropriate when you have many prompts to ...

Also: Batch Processing, Offline Inference, Bulk Inference
Caching
Process

Storing and reusing LLM responses for identical or similar requests to reduce latency and cost. Caching is particularly valuable for LLMs because inference is expensive and deterministic enough that r...

Also: Response Caching, Prompt Caching, LLM Caching
Edge Deployment
Process

Running AI models on devices close to the user—phones, laptops, edge servers— rather than in centralized cloud data centers. Edge deployment reduces latency, enables offline operation, and keeps data ...

Also: On-Device AI, Edge AI, Local Inference
Inference Endpoint
Interface

An API endpoint that accepts inference requests and returns model predictions. Inference endpoints abstract away the complexity of model serving—clients send requests (prompts) and receive responses (...

Also: API Endpoint, Model Endpoint, Prediction Endpoint
Load Balancing
Process

Distributing incoming requests across multiple servers or model replicas to optimize resource utilization, maximize throughput, and ensure high availability. For LLM serving, load balancing distribute...

Also: Traffic Distribution, Request Distribution
Model Router
System

A system that dynamically selects which model to use for a given request based on criteria like cost, latency, capability, or query complexity. Routers enable cost optimization (use cheaper models for...

Also: LLM Router, Intelligent Routing, Model Selection
Model Serving
System

The infrastructure and systems that make trained models available for inference requests. Model serving handles loading models into memory, processing requests, managing resources, and returning predi...

Also: Model Deployment, Inference Serving, Model Hosting
Rate Limiting
Process

Restricting the number of requests a client can make to an API within a time window. Rate limiting protects services from abuse, ensures fair resource allocation, and maintains system stability. For L...

Also: Throttling, Request Limits, Quota Management
Sandbox
Runtime

An isolated execution environment where code can run without affecting the host system or accessing sensitive resources. In AI systems, sandboxes are used to safely execute AI-generated code, test u...

Also: Sandboxed Environment, Isolated Environment, Safe Execution Environment
Streaming
Process

Delivering LLM output incrementally as tokens are generated rather than waiting for the complete response. Streaming improves perceived latency—users see text appearing progressively instead of waitin...

Also: Token Streaming, Server-Sent Events, SSE