Token
Also known as: Subword, Token ID
Definition
The basic unit of text that an LLM processes, typically a subword piece rather than a full word or character. Tokens are what models actually see—text is converted to tokens for input and tokens are converted back to text for output. Token counts determine context window usage, API costs, and processing time.
What this is NOT
- Not always a word (tokens are often subwords)
- Not a character (tokens are typically larger than single characters)
- Not directly visible to users (tokens are internal representation)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
A piece of text (often 3-4 characters on average in English) that maps to an integer ID in the model's vocabulary. A word might be one token or several; punctuation and spaces can be separate tokens.
Sources: OpenAI tokenizer documentation, Hugging Face tokenizer documentation
Examples
- The word 'tokenization' might be ['token', 'ization']
- GPT-4 has a vocabulary of ~100K tokens
- A 128K context window = ~128,000 tokens
- API pricing: $0.01 per 1K input tokens
Counterexamples
Things that might seem like Token but are not:
- Full words (tokens are often subwords)
- Characters (tokens group multiple characters)
- Bytes (though some tokenizers are byte-level)
Relations
- requires tokenizer (Tokens are produced by tokenizers)
- overlapsWith context-window (Context is measured in tokens)
- overlapsWith inference (Inference processes tokens)