Caching
Also known as: Response Caching, Prompt Caching, LLM Caching
Definition
Storing and reusing LLM responses for identical or similar requests to reduce latency and cost. Caching is particularly valuable for LLMs because inference is expensive and deterministic enough that repeated queries can often reuse previous results. Different caching strategies apply at different levels.
What this is NOT
- Not KV-cache (that's internal to inference; this is application-level)
- Not model weights caching (that's loading, not inference results)
- Not embedding caching specifically (though related)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Saving LLM responses keyed by the prompt (and parameters) so that identical future requests return the cached result instantly without calling the model. Also includes KV-cache reuse for shared prefixes.
Sources: GPTCache documentation, Anthropic Prompt Caching, LLM caching patterns
Examples
- GPTCache caching responses with semantic similarity matching
- Anthropic Prompt Caching for long system prompts
- Redis cache storing prompt-hash → response mappings
- Exact match cache with temperature=0 for determinism
Counterexamples
Things that might seem like Caching but are not:
- Every request hitting the model (no caching)
- KV-cache within a single inference (internal, not application caching)
- Caching model weights on GPU (different concern)
Relations
- overlapsWith api-gateway (Gateways often implement caching)
- overlapsWith model-serving (Serving can include caching layers)
- overlapsWith context-window (Prefix caching relates to context reuse)
Implementations
Tools and frameworks that implement this concept: