Inference Endpoint

Interface deployment published

Also known as: API Endpoint, Model Endpoint, Prediction Endpoint

Definition

An API endpoint that accepts inference requests and returns model predictions. Inference endpoints abstract away the complexity of model serving—clients send requests (prompts) and receive responses (completions) without managing the underlying infrastructure. They define the contract for how to interact with a deployed model.

What this is NOT

Not the model itself (endpoint is the interface)
Not the serving infrastructure (endpoint is what clients see)
Not a web application (endpoint serves model predictions)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The URL and API specification for sending prompts to an LLM and receiving completions. Examples: OpenAI's /v1/chat/completions, Anthropic's /v1/messages, self-hosted endpoints from vLLM or TGI.

Sources: OpenAI API documentation, Anthropic API documentation, Hugging Face Inference Endpoints

Examples

POST https://api.openai.com/v1/chat/completions
POST https://api.anthropic.com/v1/messages
Hugging Face Inference Endpoint for a custom model
Self-hosted vLLM endpoint at localhost:8000/v1/completions

Counterexamples

Things that might seem like Inference Endpoint but are not:

A model file on disk (not exposed as endpoint)
The model serving infrastructure (endpoint is the interface)
A general REST API (not specifically for inference)

Relations

overlapsWith model-serving (Endpoints are exposed by serving systems)
overlapsWith api-gateway (Gateways often front inference endpoints)
overlapsWith rate-limiting (Endpoints enforce rate limits)

Implementations

Tools and frameworks that implement this concept:

Amazon Bedrock primary
Azure OpenAI Service primary
Cloudflare Workers secondary
Codeium secondary
GitHub Copilot secondary
Google Cloud Platform primary
Google Cloud Vertex AI primary
Google Stitch secondary
Hugging Face primary
Modal primary
Vercel secondary