Load Balancing
Also known as: Traffic Distribution, Request Distribution
Definition
Distributing incoming requests across multiple servers or model replicas to optimize resource utilization, maximize throughput, and ensure high availability. For LLM serving, load balancing distributes inference requests across GPU instances to handle more concurrent users than a single server could support.
What this is NOT
- Not model routing (load balancing distributes to identical replicas)
- Not rate limiting (load balancing is about distribution, not restriction)
- Not auto-scaling (though often used together)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Spreading LLM inference requests across multiple model replicas or instances. Can be within a cluster (multiple GPUs) or across regions (geographic distribution).
Sources: vLLM multi-GPU documentation, Cloud load balancer documentation, Kubernetes ingress patterns
software-engineering
A technique to distribute network traffic across multiple servers, implemented via hardware appliances, software (nginx, HAProxy), or cloud services (AWS ALB, GCP Load Balancer).
Sources: Load balancing literature, Cloud provider documentation
Examples
- nginx load balancing across 4 vLLM replicas
- Kubernetes distributing requests across model pods
- AWS ALB fronting multiple SageMaker endpoints
- Geographic load balancing for global low-latency
Counterexamples
Things that might seem like Load Balancing but are not:
- Single server handling all requests
- Model routing (chooses different models, not replicas)
- Rate limiting (restricts requests, not distributes)
Relations
- overlapsWith model-serving (Serving often includes load balancing)
- overlapsWith api-gateway (Gateways often perform load balancing)
- inTensionWith model-router (Router selects models; load balancer distributes to replicas)