Transformer

System models published

Also known as: Transformer Architecture, Attention Architecture

Definition

A neural network architecture based on self-attention mechanisms, introduced in "Attention Is All You Need" (2017). Transformers process sequences by allowing each position to attend to all other positions, enabling parallel processing and capturing long-range dependencies. They are the architecture underlying virtually all modern LLMs, replacing earlier RNN-based approaches.

What this is NOT

  • Not RNNs or LSTMs (transformers replaced these)
  • Not a specific model (transformer is an architecture; GPT is a model)
  • Not convolutional networks (different operation, though sometimes combined)

Alternative Interpretations

Different communities use this term differently:

ml-research

The architecture described in Vaswani et al. (2017), consisting of self-attention layers, feed-forward networks, residual connections, and layer normalization. Originally for machine translation, now the dominant architecture for language, vision, and multimodal AI.

Sources: Attention Is All You Need (Vaswani et al., 2017), Transformer architecture surveys

Examples

  • GPT architecture (decoder-only transformer)
  • BERT (encoder-only transformer)
  • T5 (encoder-decoder transformer)
  • Vision Transformer (ViT) for images

Counterexamples

Things that might seem like Transformer but are not:

  • LSTM networks (sequential, not attention-based)
  • Convolutional neural networks (different operation)
  • State space models like Mamba (emerging alternative to transformers)

Relations

  • requires large-language-model (Modern LLMs use transformer architecture)
  • overlapsWith inference (Inference runs through the transformer layers)
  • overlapsWith context-window (Attention span determines context window)