Model Serving

System deployment published

Also known as: Model Deployment, Inference Serving, Model Hosting

Definition

The infrastructure and systems that make trained models available for inference requests. Model serving handles loading models into memory, processing requests, managing resources, and returning predictions. It's the bridge between a trained model file and a production API that applications can call.

What this is NOT

Not model training (serving is for inference)
Not the model itself (serving is the infrastructure)
Not just an API (serving includes resource management)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Running LLMs as accessible services, whether through managed APIs (OpenAI, Anthropic), self-hosted solutions (vLLM, TGI), or cloud platforms (AWS Bedrock, Azure OpenAI).

Sources: vLLM documentation, Text Generation Inference (TGI) documentation, Cloud provider ML serving documentation

ml-ops

The operational practice of deploying trained models to production, including model loading, request handling, scaling, monitoring, and lifecycle management.

Sources: MLOps literature, TensorFlow Serving, TorchServe documentation

Examples

vLLM serving Llama-70B with PagedAttention
TGI (Text Generation Inference) from Hugging Face
Ollama for local model serving
AWS SageMaker endpoint hosting a fine-tuned model

Counterexamples

Things that might seem like Model Serving but are not:

Training a model (that's training, not serving)
Calling the OpenAI API (you're using their serving)
A model file sitting on disk (not being served)

Relations

overlapsWith inference (Serving handles inference requests)
overlapsWith inference-endpoint (Endpoints are the interface to serving)
overlapsWith load-balancing (Serving often includes load balancing)

Implementations

Tools and frameworks that implement this concept: