Tokenizer

System models published

Also known as: Tokenization, Token Encoder

Definition

A component that converts raw text into tokens (numerical IDs) that an LLM can process, and converts tokens back to text. Tokenizers define the vocabulary of a model and how text is segmented. The tokenization scheme affects model behavior, efficiency, and multilingual capabilities.

What this is NOT

  • Not word splitting (tokenizers often split within words)
  • Not the model itself (tokenizer is a preprocessing component)
  • Not character-level encoding (tokenizers use subword units)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The text processing component that splits text into subword units (tokens) and maps them to integer IDs. Examples: tiktoken (OpenAI), SentencePiece (Google), Hugging Face tokenizers.

Sources: tiktoken documentation, SentencePiece paper, BPE algorithm papers

Examples

  • tiktoken (OpenAI's tokenizer for GPT models)
  • SentencePiece (used by Llama, T5)
  • Hugging Face AutoTokenizer
  • cl100k_base encoding for GPT-4

Counterexamples

Things that might seem like Tokenizer but are not:

  • Word-level tokenization (too coarse)
  • Character-level encoding (too fine, inefficient)
  • The LLM model weights (tokenizer is separate)

Relations