Streaming

Process deployment published

Also known as: Token Streaming, Server-Sent Events, SSE

Definition

Delivering LLM output incrementally as tokens are generated rather than waiting for the complete response. Streaming improves perceived latency—users see text appearing progressively instead of waiting for a potentially long generation. Implemented via Server-Sent Events (SSE) or WebSockets.

What this is NOT

Not batch inference (streaming is real-time, incremental)
Not faster generation (same speed, earlier visibility)
Not video/audio streaming (token streaming for text)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Setting stream=true in API calls to receive tokens as they're generated. The response is a stream of events, each containing one or more tokens, until a final event signals completion.

Sources: OpenAI Streaming documentation, Anthropic Streaming documentation, SSE specification

Examples

stream=true in OpenAI API calls
ChatGPT UI showing text appearing word by word
SSE events: data: {"choices":[{"delta":{"content":"Hello"}}]}
Handling streaming with async generators in Python

Counterexamples

Things that might seem like Streaming but are not:

Waiting for complete response then displaying (non-streaming)
Batch processing 1000 prompts (no real-time requirement)
Video streaming (different domain)

Relations

overlapsWith inference-endpoint (Endpoints support streaming mode)
overlapsWith model-serving (Serving systems implement streaming)
inTensionWith batch-inference (Batch is for throughput; streaming is for latency)

Implementations

Tools and frameworks that implement this concept:

Text Generation Inference secondary
vLLM secondary