Retrieval-Augmented Generation
Also known as: RAG, Retrieval Augmentation
Definition
A pattern that enhances LLM generation by first retrieving relevant documents from an external knowledge source, then including those documents in the prompt as context. RAG addresses LLM limitations: knowledge cutoffs, hallucination, and lack of access to private data. The retrieval step grounds the model's response in actual source material.
What this is NOT
- Not fine-tuning (RAG doesn't modify model weights)
- Not the same as search (RAG uses search but includes generation)
- Not agentic by default (basic RAG is a fixed pipeline, not a decision loop)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
A pipeline pattern: query → retrieve documents → augment prompt with retrieved context → generate response. Implementations range from simple (retrieve top-k, concatenate) to complex (multi-step retrieval, reranking, query rewriting).
Sources: RAG paper (Lewis et al., 2020), LangChain RAG documentation, LlamaIndex documentation
academic-nlp
A class of models that combine parametric memory (model weights) with non-parametric memory (retrieval index) to improve knowledge-intensive NLP tasks.
Sources: RAG paper (Lewis et al., 2020), REALM, RETRO papers
Examples
- Customer support bot that retrieves relevant documentation before answering
- Legal research tool that finds relevant case law and summarizes it
- Enterprise Q&A over internal wikis and documents
- Chatbot that cites sources for its claims
Counterexamples
Things that might seem like Retrieval-Augmented Generation but are not:
- A chatbot that only uses its parametric knowledge (no retrieval)
- A search engine that returns documents without generation
- Fine-tuning a model on domain data (modifies weights, not retrieval)
Relations
- requires embedding (Documents are typically embedded for retrieval)
- requires vector-search (Retrieval often uses vector similarity search)
- overlapsWith knowledge-base (RAG retrieves from a knowledge base)
- overlapsWith chunking (Documents are chunked before indexing)
- inTensionWith fine-tuning (Alternative approaches to adding knowledge to models)
Implementations
Tools and frameworks that implement this concept:
- Amazon Bedrock secondary
- Claude Chrome Extension secondary
- Cohere secondary
- Context7 primary
- Cursor secondary
- Dify primary
- Flowise primary
- Google Cloud Vertex AI secondary
- Haystack primary
- LangChain primary
- LlamaIndex primary