Model Serving

System deployment published

Also known as: Model Deployment, Inference Serving, Model Hosting

Definition

The infrastructure and systems that make trained models available for inference requests. Model serving handles loading models into memory, processing requests, managing resources, and returning predictions. It's the bridge between a trained model file and a production API that applications can call.

What this is NOT

  • Not model training (serving is for inference)
  • Not the model itself (serving is the infrastructure)
  • Not just an API (serving includes resource management)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Running LLMs as accessible services, whether through managed APIs (OpenAI, Anthropic), self-hosted solutions (vLLM, TGI), or cloud platforms (AWS Bedrock, Azure OpenAI).

Sources: vLLM documentation, Text Generation Inference (TGI) documentation, Cloud provider ML serving documentation

ml-ops

The operational practice of deploying trained models to production, including model loading, request handling, scaling, monitoring, and lifecycle management.

Sources: MLOps literature, TensorFlow Serving, TorchServe documentation

Examples

  • vLLM serving Llama-70B with PagedAttention
  • TGI (Text Generation Inference) from Hugging Face
  • Ollama for local model serving
  • AWS SageMaker endpoint hosting a fine-tuned model

Counterexamples

Things that might seem like Model Serving but are not:

  • Training a model (that's training, not serving)
  • Calling the OpenAI API (you're using their serving)
  • A model file sitting on disk (not being served)

Relations

Implementations

Tools and frameworks that implement this concept: