Inference Endpoint
Also known as: API Endpoint, Model Endpoint, Prediction Endpoint
Definition
An API endpoint that accepts inference requests and returns model predictions. Inference endpoints abstract away the complexity of model serving—clients send requests (prompts) and receive responses (completions) without managing the underlying infrastructure. They define the contract for how to interact with a deployed model.
What this is NOT
- Not the model itself (endpoint is the interface)
- Not the serving infrastructure (endpoint is what clients see)
- Not a web application (endpoint serves model predictions)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
The URL and API specification for sending prompts to an LLM and receiving completions. Examples: OpenAI's /v1/chat/completions, Anthropic's /v1/messages, self-hosted endpoints from vLLM or TGI.
Sources: OpenAI API documentation, Anthropic API documentation, Hugging Face Inference Endpoints
Examples
- POST https://api.openai.com/v1/chat/completions
- POST https://api.anthropic.com/v1/messages
- Hugging Face Inference Endpoint for a custom model
- Self-hosted vLLM endpoint at localhost:8000/v1/completions
Counterexamples
Things that might seem like Inference Endpoint but are not:
- A model file on disk (not exposed as endpoint)
- The model serving infrastructure (endpoint is the interface)
- A general REST API (not specifically for inference)
Relations
- overlapsWith model-serving (Endpoints are exposed by serving systems)
- overlapsWith api-gateway (Gateways often front inference endpoints)
- overlapsWith rate-limiting (Endpoints enforce rate limits)
Implementations
Tools and frameworks that implement this concept:
- Amazon Bedrock primary
- Azure OpenAI Service primary
- Cloudflare Workers secondary
- Codeium secondary
- GitHub Copilot secondary
- Google Cloud Platform primary
- Google Cloud Vertex AI primary
- Google Stitch secondary
- Hugging Face primary
- Modal primary
- Vercel secondary