Inference

Process models published

Also known as: Model Inference, Prediction, Forward Pass

Definition

Running a trained model on input data to produce output (predictions, generated text, classifications). Inference is the "using" phase of machine learning, as opposed to training. For LLMs, inference means processing a prompt through the model to generate a response, one token at a time.

What this is NOT

  • Not training (inference doesn't update weights)
  • Not fine-tuning (that involves training)
  • Not the model architecture (inference is the execution)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The process of sending a prompt to an LLM and receiving a completion. Inference happens at API endpoints or when running models locally. Characterized by latency (time to first token, time to completion) and throughput (tokens per second).

Sources: Model serving documentation, LLM inference optimization papers

ml-engineering

Executing a forward pass through a neural network to produce output from input, using trained weights without updating them.

Sources: ML systems literature

Examples

  • Sending a chat message to GPT-4 API and receiving a response
  • Running Llama locally with llama.cpp
  • Batch inference processing 1000 prompts
  • Streaming inference where tokens arrive as generated

Counterexamples

Things that might seem like Inference but are not:

  • Training a model on data (that's training)
  • Fine-tuning on custom data (that's training)
  • Model architecture design (that's before inference)

Relations

  • requires large-language-model (Inference runs LLMs)
  • requires token (Inference processes tokens)
  • inTensionWith fine-tuning (Inference uses trained models; fine-tuning trains them)
  • overlapsWith model-serving (Model serving handles inference requests)

Implementations

Tools and frameworks that implement this concept: