Model Serving
Also known as: Model Deployment, Inference Serving, Model Hosting
Definition
The infrastructure and systems that make trained models available for inference requests. Model serving handles loading models into memory, processing requests, managing resources, and returning predictions. It's the bridge between a trained model file and a production API that applications can call.
What this is NOT
- Not model training (serving is for inference)
- Not the model itself (serving is the infrastructure)
- Not just an API (serving includes resource management)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Running LLMs as accessible services, whether through managed APIs (OpenAI, Anthropic), self-hosted solutions (vLLM, TGI), or cloud platforms (AWS Bedrock, Azure OpenAI).
Sources: vLLM documentation, Text Generation Inference (TGI) documentation, Cloud provider ML serving documentation
ml-ops
The operational practice of deploying trained models to production, including model loading, request handling, scaling, monitoring, and lifecycle management.
Sources: MLOps literature, TensorFlow Serving, TorchServe documentation
Examples
- vLLM serving Llama-70B with PagedAttention
- TGI (Text Generation Inference) from Hugging Face
- Ollama for local model serving
- AWS SageMaker endpoint hosting a fine-tuned model
Counterexamples
Things that might seem like Model Serving but are not:
- Training a model (that's training, not serving)
- Calling the OpenAI API (you're using their serving)
- A model file sitting on disk (not being served)
Relations
- overlapsWith inference (Serving handles inference requests)
- overlapsWith inference-endpoint (Endpoints are the interface to serving)
- overlapsWith load-balancing (Serving often includes load balancing)
Implementations
Tools and frameworks that implement this concept:
- Amazon Bedrock primary
- Azure OpenAI Service primary
- Google Cloud Platform primary
- Google Cloud Vertex AI primary
- Modal secondary
- Ollama primary
- Text Generation Inference primary
- vLLM primary