Transformer
Also known as: Transformer Architecture, Attention Architecture
Definition
A neural network architecture based on self-attention mechanisms, introduced in "Attention Is All You Need" (2017). Transformers process sequences by allowing each position to attend to all other positions, enabling parallel processing and capturing long-range dependencies. They are the architecture underlying virtually all modern LLMs, replacing earlier RNN-based approaches.
What this is NOT
- Not RNNs or LSTMs (transformers replaced these)
- Not a specific model (transformer is an architecture; GPT is a model)
- Not convolutional networks (different operation, though sometimes combined)
Alternative Interpretations
Different communities use this term differently:
ml-research
The architecture described in Vaswani et al. (2017), consisting of self-attention layers, feed-forward networks, residual connections, and layer normalization. Originally for machine translation, now the dominant architecture for language, vision, and multimodal AI.
Sources: Attention Is All You Need (Vaswani et al., 2017), Transformer architecture surveys
Examples
- GPT architecture (decoder-only transformer)
- BERT (encoder-only transformer)
- T5 (encoder-decoder transformer)
- Vision Transformer (ViT) for images
Counterexamples
Things that might seem like Transformer but are not:
- LSTM networks (sequential, not attention-based)
- Convolutional neural networks (different operation)
- State space models like Mamba (emerging alternative to transformers)
Relations
- requires large-language-model (Modern LLMs use transformer architecture)
- overlapsWith inference (Inference runs through the transformer layers)
- overlapsWith context-window (Attention span determines context window)