Transformer

System models published

Also known as: Transformer Architecture, Attention Architecture

Definition

A neural network architecture based on self-attention mechanisms, introduced in "Attention Is All You Need" (2017). Transformers process sequences by allowing each position to attend to all other positions, enabling parallel processing and capturing long-range dependencies. They are the architecture underlying virtually all modern LLMs, replacing earlier RNN-based approaches.

What this is NOT

Not RNNs or LSTMs (transformers replaced these)
Not a specific model (transformer is an architecture; GPT is a model)
Not convolutional networks (different operation, though sometimes combined)

Alternative Interpretations

Different communities use this term differently:

ml-research

The architecture described in Vaswani et al. (2017), consisting of self-attention layers, feed-forward networks, residual connections, and layer normalization. Originally for machine translation, now the dominant architecture for language, vision, and multimodal AI.

Sources: Attention Is All You Need (Vaswani et al., 2017), Transformer architecture surveys

Examples

GPT architecture (decoder-only transformer)
BERT (encoder-only transformer)
T5 (encoder-decoder transformer)
Vision Transformer (ViT) for images

Counterexamples

Things that might seem like Transformer but are not:

LSTM networks (sequential, not attention-based)
Convolutional neural networks (different operation)
State space models like Mamba (emerging alternative to transformers)

Relations

requires large-language-model (Modern LLMs use transformer architecture)
overlapsWith inference (Inference runs through the transformer layers)
overlapsWith context-window (Attention span determines context window)