Load Balancing

Process deployment published

Also known as: Traffic Distribution, Request Distribution

Definition

Distributing incoming requests across multiple servers or model replicas to optimize resource utilization, maximize throughput, and ensure high availability. For LLM serving, load balancing distributes inference requests across GPU instances to handle more concurrent users than a single server could support.

What this is NOT

Not model routing (load balancing distributes to identical replicas)
Not rate limiting (load balancing is about distribution, not restriction)
Not auto-scaling (though often used together)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Spreading LLM inference requests across multiple model replicas or instances. Can be within a cluster (multiple GPUs) or across regions (geographic distribution).

Sources: vLLM multi-GPU documentation, Cloud load balancer documentation, Kubernetes ingress patterns

software-engineering

A technique to distribute network traffic across multiple servers, implemented via hardware appliances, software (nginx, HAProxy), or cloud services (AWS ALB, GCP Load Balancer).

Sources: Load balancing literature, Cloud provider documentation

Examples

nginx load balancing across 4 vLLM replicas
Kubernetes distributing requests across model pods
AWS ALB fronting multiple SageMaker endpoints
Geographic load balancing for global low-latency

Counterexamples

Things that might seem like Load Balancing but are not:

Single server handling all requests
Model routing (chooses different models, not replicas)
Rate limiting (restricts requests, not distributes)

Relations

overlapsWith model-serving (Serving often includes load balancing)
overlapsWith api-gateway (Gateways often perform load balancing)
inTensionWith model-router (Router selects models; load balancer distributes to replicas)