Token

Artifact models published

Also known as: Subword, Token ID

Definition

The basic unit of text that an LLM processes, typically a subword piece rather than a full word or character. Tokens are what models actually see—text is converted to tokens for input and tokens are converted back to text for output. Token counts determine context window usage, API costs, and processing time.

What this is NOT

Not always a word (tokens are often subwords)
Not a character (tokens are typically larger than single characters)
Not directly visible to users (tokens are internal representation)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

A piece of text (often 3-4 characters on average in English) that maps to an integer ID in the model's vocabulary. A word might be one token or several; punctuation and spaces can be separate tokens.

Sources: OpenAI tokenizer documentation, Hugging Face tokenizer documentation

Examples

The word 'tokenization' might be ['token', 'ization']
GPT-4 has a vocabulary of ~100K tokens
A 128K context window = ~128,000 tokens
API pricing: $0.01 per 1K input tokens

Counterexamples

Things that might seem like Token but are not:

Full words (tokens are often subwords)
Characters (tokens group multiple characters)
Bytes (though some tokenizers are byte-level)

Relations

requires tokenizer (Tokens are produced by tokenizers)
overlapsWith context-window (Context is measured in tokens)
overlapsWith inference (Inference processes tokens)