Chunking
Also known as: Text Chunking, Document Chunking, Splitting
Definition
The process of dividing documents into smaller pieces (chunks) for embedding and retrieval. Chunking is necessary because: (1) embedding models have input limits, (2) smaller chunks enable more precise retrieval, (3) LLM context windows are finite. Chunking strategy significantly impacts RAG quality—poor chunking produces irrelevant or incomplete retrievals.
What this is NOT
- Not tokenization (tokenization is subword splitting for the model; chunking is document splitting for retrieval)
- Not summarization (chunks preserve original text; summaries compress it)
- Not just splitting on newlines (effective chunking considers semantics)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Splitting documents into segments suitable for embedding, typically 100-1000 tokens each. Strategies include: fixed-size, sentence-based, paragraph-based, semantic (by topic), and recursive (hierarchical).
Sources: LangChain text splitters documentation, LlamaIndex node parsers documentation
Examples
- Splitting a PDF into 500-token chunks with 50-token overlap
- Chunking code by function/class boundaries
- Semantic chunking that keeps related paragraphs together
- Markdown-aware chunking that respects header hierarchy
Counterexamples
Things that might seem like Chunking but are not:
- Storing entire documents without splitting (won't fit in context)
- Tokenizing text into individual tokens (too granular)
- Summarizing documents (loses detail)
Relations
- requires embedding (Chunks are embedded for retrieval)
- requires retrieval-augmented-generation (Chunking is a preprocessing step for RAG)
- overlapsWith knowledge-base (Knowledge bases store chunks)
Implementations
Tools and frameworks that implement this concept:
- LlamaIndex primary