Inference
Also known as: Model Inference, Prediction, Forward Pass
Definition
Running a trained model on input data to produce output (predictions, generated text, classifications). Inference is the "using" phase of machine learning, as opposed to training. For LLMs, inference means processing a prompt through the model to generate a response, one token at a time.
What this is NOT
- Not training (inference doesn't update weights)
- Not fine-tuning (that involves training)
- Not the model architecture (inference is the execution)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
The process of sending a prompt to an LLM and receiving a completion. Inference happens at API endpoints or when running models locally. Characterized by latency (time to first token, time to completion) and throughput (tokens per second).
Sources: Model serving documentation, LLM inference optimization papers
ml-engineering
Executing a forward pass through a neural network to produce output from input, using trained weights without updating them.
Sources: ML systems literature
Examples
- Sending a chat message to GPT-4 API and receiving a response
- Running Llama locally with llama.cpp
- Batch inference processing 1000 prompts
- Streaming inference where tokens arrive as generated
Counterexamples
Things that might seem like Inference but are not:
- Training a model on data (that's training)
- Fine-tuning on custom data (that's training)
- Model architecture design (that's before inference)
Relations
- requires large-language-model (Inference runs LLMs)
- requires token (Inference processes tokens)
- inTensionWith fine-tuning (Inference uses trained models; fine-tuning trains them)
- overlapsWith model-serving (Model serving handles inference requests)
Implementations
Tools and frameworks that implement this concept:
- llama.cpp primary
- LM Studio primary
- Ollama primary
- Text Generation Inference primary
- vLLM primary