Retrieval-Augmented Generation

Process retrieval published

Also known as: RAG, Retrieval Augmentation

Definition

A pattern that enhances LLM generation by first retrieving relevant documents from an external knowledge source, then including those documents in the prompt as context. RAG addresses LLM limitations: knowledge cutoffs, hallucination, and lack of access to private data. The retrieval step grounds the model's response in actual source material.

What this is NOT

  • Not fine-tuning (RAG doesn't modify model weights)
  • Not the same as search (RAG uses search but includes generation)
  • Not agentic by default (basic RAG is a fixed pipeline, not a decision loop)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

A pipeline pattern: query → retrieve documents → augment prompt with retrieved context → generate response. Implementations range from simple (retrieve top-k, concatenate) to complex (multi-step retrieval, reranking, query rewriting).

Sources: RAG paper (Lewis et al., 2020), LangChain RAG documentation, LlamaIndex documentation

academic-nlp

A class of models that combine parametric memory (model weights) with non-parametric memory (retrieval index) to improve knowledge-intensive NLP tasks.

Sources: RAG paper (Lewis et al., 2020), REALM, RETRO papers

Examples

  • Customer support bot that retrieves relevant documentation before answering
  • Legal research tool that finds relevant case law and summarizes it
  • Enterprise Q&A over internal wikis and documents
  • Chatbot that cites sources for its claims

Counterexamples

Things that might seem like Retrieval-Augmented Generation but are not:

  • A chatbot that only uses its parametric knowledge (no retrieval)
  • A search engine that returns documents without generation
  • Fine-tuning a model on domain data (modifies weights, not retrieval)

Relations

  • requires embedding (Documents are typically embedded for retrieval)
  • requires vector-search (Retrieval often uses vector similarity search)
  • overlapsWith knowledge-base (RAG retrieves from a knowledge base)
  • overlapsWith chunking (Documents are chunked before indexing)
  • inTensionWith fine-tuning (Alternative approaches to adding knowledge to models)

Implementations

Tools and frameworks that implement this concept: