Inference Endpoint

Interface deployment published

Also known as: API Endpoint, Model Endpoint, Prediction Endpoint

Definition

An API endpoint that accepts inference requests and returns model predictions. Inference endpoints abstract away the complexity of model serving—clients send requests (prompts) and receive responses (completions) without managing the underlying infrastructure. They define the contract for how to interact with a deployed model.

What this is NOT

  • Not the model itself (endpoint is the interface)
  • Not the serving infrastructure (endpoint is what clients see)
  • Not a web application (endpoint serves model predictions)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The URL and API specification for sending prompts to an LLM and receiving completions. Examples: OpenAI's /v1/chat/completions, Anthropic's /v1/messages, self-hosted endpoints from vLLM or TGI.

Sources: OpenAI API documentation, Anthropic API documentation, Hugging Face Inference Endpoints

Examples

  • POST https://api.openai.com/v1/chat/completions
  • POST https://api.anthropic.com/v1/messages
  • Hugging Face Inference Endpoint for a custom model
  • Self-hosted vLLM endpoint at localhost:8000/v1/completions

Counterexamples

Things that might seem like Inference Endpoint but are not:

  • A model file on disk (not exposed as endpoint)
  • The model serving infrastructure (endpoint is the interface)
  • A general REST API (not specifically for inference)

Relations

  • overlapsWith model-serving (Endpoints are exposed by serving systems)
  • overlapsWith api-gateway (Gateways often front inference endpoints)
  • overlapsWith rate-limiting (Endpoints enforce rate limits)

Implementations

Tools and frameworks that implement this concept: