Retrieval-Augmented Generation

Process retrieval published

Also known as: RAG, Retrieval Augmentation

Definition

A pattern that enhances LLM generation by first retrieving relevant documents from an external knowledge source, then including those documents in the prompt as context. RAG addresses LLM limitations: knowledge cutoffs, hallucination, and lack of access to private data. The retrieval step grounds the model's response in actual source material.

What this is NOT

Not fine-tuning (RAG doesn't modify model weights)
Not the same as search (RAG uses search but includes generation)
Not agentic by default (basic RAG is a fixed pipeline, not a decision loop)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

A pipeline pattern: query → retrieve documents → augment prompt with retrieved context → generate response. Implementations range from simple (retrieve top-k, concatenate) to complex (multi-step retrieval, reranking, query rewriting).

Sources: RAG paper (Lewis et al., 2020), LangChain RAG documentation, LlamaIndex documentation

academic-nlp

A class of models that combine parametric memory (model weights) with non-parametric memory (retrieval index) to improve knowledge-intensive NLP tasks.

Sources: RAG paper (Lewis et al., 2020), REALM, RETRO papers

Examples

Customer support bot that retrieves relevant documentation before answering
Legal research tool that finds relevant case law and summarizes it
Enterprise Q&A over internal wikis and documents
Chatbot that cites sources for its claims

Counterexamples

Things that might seem like Retrieval-Augmented Generation but are not:

A chatbot that only uses its parametric knowledge (no retrieval)
A search engine that returns documents without generation
Fine-tuning a model on domain data (modifies weights, not retrieval)

Relations

requires embedding (Documents are typically embedded for retrieval)
requires vector-search (Retrieval often uses vector similarity search)
overlapsWith knowledge-base (RAG retrieves from a knowledge base)
overlapsWith chunking (Documents are chunked before indexing)
inTensionWith fine-tuning (Alternative approaches to adding knowledge to models)

Implementations

Tools and frameworks that implement this concept:

Amazon Bedrock secondary
Claude Chrome Extension secondary
Cohere secondary
Context7 primary
Cursor secondary
Dify primary
Flowise primary
Google Cloud Vertex AI secondary
Haystack primary
LangChain primary
LlamaIndex primary