Tokenizer

System models published

Also known as: Tokenization, Token Encoder

Definition

A component that converts raw text into tokens (numerical IDs) that an LLM can process, and converts tokens back to text. Tokenizers define the vocabulary of a model and how text is segmented. The tokenization scheme affects model behavior, efficiency, and multilingual capabilities.

What this is NOT

Not word splitting (tokenizers often split within words)
Not the model itself (tokenizer is a preprocessing component)
Not character-level encoding (tokenizers use subword units)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The text processing component that splits text into subword units (tokens) and maps them to integer IDs. Examples: tiktoken (OpenAI), SentencePiece (Google), Hugging Face tokenizers.

Sources: tiktoken documentation, SentencePiece paper, BPE algorithm papers

Examples

tiktoken (OpenAI's tokenizer for GPT models)
SentencePiece (used by Llama, T5)
Hugging Face AutoTokenizer
cl100k_base encoding for GPT-4

Counterexamples

Things that might seem like Tokenizer but are not:

Word-level tokenization (too coarse)
Character-level encoding (too fine, inefficient)
The LLM model weights (tokenizer is separate)

Relations

requires large-language-model (LLMs need tokenizers to process text)
produces token (Tokenizers produce tokens)
overlapsWith context-window (Context is measured in tokens)