Rate Limiting
Also known as: Throttling, Request Limits, Quota Management
Definition
Restricting the number of requests a client can make to an API within a time window. Rate limiting protects services from abuse, ensures fair resource allocation, and maintains system stability. For LLM APIs, limits are often expressed in requests per minute (RPM) and tokens per minute (TPM).
What this is NOT
- Not cost limits (rate limits are about throughput, not spend)
- Not authentication (rate limiting happens after auth)
- Not load balancing (though related to capacity management)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
The quotas enforced by LLM API providers limiting how many requests and tokens you can use. Hitting rate limits returns 429 errors and requires backoff or upgrading to higher tiers.
Sources: OpenAI Rate Limits documentation, Anthropic Usage Limits, API provider documentation
Examples
- OpenAI Tier 1: 500 RPM, 30,000 TPM for GPT-4
- 429 Too Many Requests error with Retry-After header
- Implementing exponential backoff: wait 1s, 2s, 4s, 8s...
- Token bucket algorithm for client-side rate limiting
Counterexamples
Things that might seem like Rate Limiting but are not:
- Cost limits or spending caps (different concern)
- Authentication failures (401, not 429)
- Model capability limits (different type of limitation)
Relations
- overlapsWith inference-endpoint (Endpoints enforce rate limits)
- overlapsWith api-gateway (Gateways often implement rate limiting)
- overlapsWith model-serving (Serving systems implement rate limiting)
Implementations
Tools and frameworks that implement this concept:
- Helicone secondary