Rate Limiting

Process deployment published

Also known as: Throttling, Request Limits, Quota Management

Definition

Restricting the number of requests a client can make to an API within a time window. Rate limiting protects services from abuse, ensures fair resource allocation, and maintains system stability. For LLM APIs, limits are often expressed in requests per minute (RPM) and tokens per minute (TPM).

What this is NOT

Not cost limits (rate limits are about throughput, not spend)
Not authentication (rate limiting happens after auth)
Not load balancing (though related to capacity management)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The quotas enforced by LLM API providers limiting how many requests and tokens you can use. Hitting rate limits returns 429 errors and requires backoff or upgrading to higher tiers.

Sources: OpenAI Rate Limits documentation, Anthropic Usage Limits, API provider documentation

Examples

OpenAI Tier 1: 500 RPM, 30,000 TPM for GPT-4
429 Too Many Requests error with Retry-After header
Implementing exponential backoff: wait 1s, 2s, 4s, 8s...
Token bucket algorithm for client-side rate limiting

Counterexamples

Things that might seem like Rate Limiting but are not:

Cost limits or spending caps (different concern)
Authentication failures (401, not 429)
Model capability limits (different type of limitation)

Relations

overlapsWith inference-endpoint (Endpoints enforce rate limits)
overlapsWith api-gateway (Gateways often implement rate limiting)
overlapsWith model-serving (Serving systems implement rate limiting)

Implementations

Tools and frameworks that implement this concept:

Helicone secondary