Inference

Process models published

Also known as: Model Inference, Prediction, Forward Pass

Definition

Running a trained model on input data to produce output (predictions, generated text, classifications). Inference is the "using" phase of machine learning, as opposed to training. For LLMs, inference means processing a prompt through the model to generate a response, one token at a time.

What this is NOT

Not training (inference doesn't update weights)
Not fine-tuning (that involves training)
Not the model architecture (inference is the execution)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The process of sending a prompt to an LLM and receiving a completion. Inference happens at API endpoints or when running models locally. Characterized by latency (time to first token, time to completion) and throughput (tokens per second).

Sources: Model serving documentation, LLM inference optimization papers

ml-engineering

Executing a forward pass through a neural network to produce output from input, using trained weights without updating them.

Sources: ML systems literature

Examples

Sending a chat message to GPT-4 API and receiving a response
Running Llama locally with llama.cpp
Batch inference processing 1000 prompts
Streaming inference where tokens arrive as generated

Counterexamples

Things that might seem like Inference but are not:

Training a model on data (that's training)
Fine-tuning on custom data (that's training)
Model architecture design (that's before inference)

Relations

requires large-language-model (Inference runs LLMs)
requires token (Inference processes tokens)
inTensionWith fine-tuning (Inference uses trained models; fine-tuning trains them)
overlapsWith model-serving (Model serving handles inference requests)

Implementations

Tools and frameworks that implement this concept:

llama.cpp primary
LM Studio primary
Ollama primary
Text Generation Inference primary
vLLM primary