Batch Inference

Process deployment published

Also known as: Batch Processing, Offline Inference, Bulk Inference

Definition

Processing many inference requests together as a batch rather than one at a time. Batch inference optimizes for throughput and cost rather than latency— it's appropriate when you have many prompts to process and don't need immediate results. Often significantly cheaper than real-time inference.

What this is NOT

  • Not real-time inference (batch has higher latency)
  • Not streaming (batch returns complete results, not incremental)
  • Not continuous batching (that's a serving optimization, not a user-facing mode)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Submitting multiple prompts to be processed together, often through batch APIs (OpenAI Batch API) or by running inference jobs over datasets. Results are returned asynchronously, sometimes hours later.

Sources: OpenAI Batch API documentation, Hugging Face batch inference patterns, vLLM batch processing

Examples

  • OpenAI Batch API processing 10,000 prompts at 50% discount
  • Running evaluation suite over a benchmark dataset
  • Bulk content moderation for uploaded documents
  • Generating embeddings for an entire document corpus

Counterexamples

Things that might seem like Batch Inference but are not:

  • Interactive chat with streaming response
  • Real-time API calls with immediate response
  • Single synchronous inference request

Relations

  • inTensionWith streaming (Batch optimizes throughput; streaming optimizes latency)
  • overlapsWith inference (Batch inference is still inference)
  • overlapsWith model-serving (Serving systems support batch mode)