deployment
11 concepts in this domain
-
API Gateway
SystemA server that acts as an intermediary between clients and backend services, handling cross-cutting concerns like authentication, rate limiting, logging, and routing. For LLM applications, API gateways...
Also: Gateway, LLM Gateway, AI Gateway
-
Batch Inference
ProcessProcessing many inference requests together as a batch rather than one at a time. Batch inference optimizes for throughput and cost rather than latency— it's appropriate when you have many prompts to ...
Also: Batch Processing, Offline Inference, Bulk Inference
-
Caching
ProcessStoring and reusing LLM responses for identical or similar requests to reduce latency and cost. Caching is particularly valuable for LLMs because inference is expensive and deterministic enough that r...
Also: Response Caching, Prompt Caching, LLM Caching
-
Edge Deployment
ProcessRunning AI models on devices close to the user—phones, laptops, edge servers— rather than in centralized cloud data centers. Edge deployment reduces latency, enables offline operation, and keeps data ...
Also: On-Device AI, Edge AI, Local Inference
-
Inference Endpoint
InterfaceAn API endpoint that accepts inference requests and returns model predictions. Inference endpoints abstract away the complexity of model serving—clients send requests (prompts) and receive responses (...
Also: API Endpoint, Model Endpoint, Prediction Endpoint
-
Load Balancing
ProcessDistributing incoming requests across multiple servers or model replicas to optimize resource utilization, maximize throughput, and ensure high availability. For LLM serving, load balancing distribute...
Also: Traffic Distribution, Request Distribution
-
Model Router
SystemA system that dynamically selects which model to use for a given request based on criteria like cost, latency, capability, or query complexity. Routers enable cost optimization (use cheaper models for...
Also: LLM Router, Intelligent Routing, Model Selection
-
Model Serving
SystemThe infrastructure and systems that make trained models available for inference requests. Model serving handles loading models into memory, processing requests, managing resources, and returning predi...
Also: Model Deployment, Inference Serving, Model Hosting
-
Rate Limiting
ProcessRestricting the number of requests a client can make to an API within a time window. Rate limiting protects services from abuse, ensures fair resource allocation, and maintains system stability. For L...
Also: Throttling, Request Limits, Quota Management
-
Sandbox
RuntimeAn isolated execution environment where code can run without affecting the host system or accessing sensitive resources. In AI systems, sandboxes are used to safely execute AI-generated code, test u...
Also: Sandboxed Environment, Isolated Environment, Safe Execution Environment
-
Streaming
ProcessDelivering LLM output incrementally as tokens are generated rather than waiting for the complete response. Streaming improves perceived latency—users see text appearing progressively instead of waitin...
Also: Token Streaming, Server-Sent Events, SSE