Caching

Process deployment published

Also known as: Response Caching, Prompt Caching, LLM Caching

Definition

Storing and reusing LLM responses for identical or similar requests to reduce latency and cost. Caching is particularly valuable for LLMs because inference is expensive and deterministic enough that repeated queries can often reuse previous results. Different caching strategies apply at different levels.

What this is NOT

Not KV-cache (that's internal to inference; this is application-level)
Not model weights caching (that's loading, not inference results)
Not embedding caching specifically (though related)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Saving LLM responses keyed by the prompt (and parameters) so that identical future requests return the cached result instantly without calling the model. Also includes KV-cache reuse for shared prefixes.

Sources: GPTCache documentation, Anthropic Prompt Caching, LLM caching patterns

Examples

GPTCache caching responses with semantic similarity matching
Anthropic Prompt Caching for long system prompts
Redis cache storing prompt-hash → response mappings
Exact match cache with temperature=0 for determinism

Counterexamples

Things that might seem like Caching but are not:

Every request hitting the model (no caching)
KV-cache within a single inference (internal, not application caching)
Caching model weights on GPU (different concern)

Relations

overlapsWith api-gateway (Gateways often implement caching)
overlapsWith model-serving (Serving can include caching layers)
overlapsWith context-window (Prefix caching relates to context reuse)

Implementations

Tools and frameworks that implement this concept:

Helicone secondary
Portkey secondary