Batch Inference

Process deployment published

Also known as: Batch Processing, Offline Inference, Bulk Inference

Definition

Processing many inference requests together as a batch rather than one at a time. Batch inference optimizes for throughput and cost rather than latency— it's appropriate when you have many prompts to process and don't need immediate results. Often significantly cheaper than real-time inference.

What this is NOT

Not real-time inference (batch has higher latency)
Not streaming (batch returns complete results, not incremental)
Not continuous batching (that's a serving optimization, not a user-facing mode)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Submitting multiple prompts to be processed together, often through batch APIs (OpenAI Batch API) or by running inference jobs over datasets. Results are returned asynchronously, sometimes hours later.

Sources: OpenAI Batch API documentation, Hugging Face batch inference patterns, vLLM batch processing

Examples

OpenAI Batch API processing 10,000 prompts at 50% discount
Running evaluation suite over a benchmark dataset
Bulk content moderation for uploaded documents
Generating embeddings for an entire document corpus

Counterexamples

Things that might seem like Batch Inference but are not:

Interactive chat with streaming response
Real-time API calls with immediate response
Single synchronous inference request

Relations

inTensionWith streaming (Batch optimizes throughput; streaming optimizes latency)
overlapsWith inference (Batch inference is still inference)
overlapsWith model-serving (Serving systems support batch mode)