Tokenizer
Also known as: Tokenization, Token Encoder
Definition
A component that converts raw text into tokens (numerical IDs) that an LLM can process, and converts tokens back to text. Tokenizers define the vocabulary of a model and how text is segmented. The tokenization scheme affects model behavior, efficiency, and multilingual capabilities.
What this is NOT
- Not word splitting (tokenizers often split within words)
- Not the model itself (tokenizer is a preprocessing component)
- Not character-level encoding (tokenizers use subword units)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
The text processing component that splits text into subword units (tokens) and maps them to integer IDs. Examples: tiktoken (OpenAI), SentencePiece (Google), Hugging Face tokenizers.
Sources: tiktoken documentation, SentencePiece paper, BPE algorithm papers
Examples
- tiktoken (OpenAI's tokenizer for GPT models)
- SentencePiece (used by Llama, T5)
- Hugging Face AutoTokenizer
- cl100k_base encoding for GPT-4
Counterexamples
Things that might seem like Tokenizer but are not:
- Word-level tokenization (too coarse)
- Character-level encoding (too fine, inefficient)
- The LLM model weights (tokenizer is separate)
Relations
- requires large-language-model (LLMs need tokenizers to process text)
- produces token (Tokenizers produce tokens)
- overlapsWith context-window (Context is measured in tokens)