Batch Inference
Also known as: Batch Processing, Offline Inference, Bulk Inference
Definition
Processing many inference requests together as a batch rather than one at a time. Batch inference optimizes for throughput and cost rather than latency— it's appropriate when you have many prompts to process and don't need immediate results. Often significantly cheaper than real-time inference.
What this is NOT
- Not real-time inference (batch has higher latency)
- Not streaming (batch returns complete results, not incremental)
- Not continuous batching (that's a serving optimization, not a user-facing mode)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Submitting multiple prompts to be processed together, often through batch APIs (OpenAI Batch API) or by running inference jobs over datasets. Results are returned asynchronously, sometimes hours later.
Sources: OpenAI Batch API documentation, Hugging Face batch inference patterns, vLLM batch processing
Examples
- OpenAI Batch API processing 10,000 prompts at 50% discount
- Running evaluation suite over a benchmark dataset
- Bulk content moderation for uploaded documents
- Generating embeddings for an entire document corpus
Counterexamples
Things that might seem like Batch Inference but are not:
- Interactive chat with streaming response
- Real-time API calls with immediate response
- Single synchronous inference request
Relations
- inTensionWith streaming (Batch optimizes throughput; streaming optimizes latency)
- overlapsWith inference (Batch inference is still inference)
- overlapsWith model-serving (Serving systems support batch mode)