Streaming

Process deployment published

Also known as: Token Streaming, Server-Sent Events, SSE

Definition

Delivering LLM output incrementally as tokens are generated rather than waiting for the complete response. Streaming improves perceived latency—users see text appearing progressively instead of waiting for a potentially long generation. Implemented via Server-Sent Events (SSE) or WebSockets.

What this is NOT

  • Not batch inference (streaming is real-time, incremental)
  • Not faster generation (same speed, earlier visibility)
  • Not video/audio streaming (token streaming for text)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Setting stream=true in API calls to receive tokens as they're generated. The response is a stream of events, each containing one or more tokens, until a final event signals completion.

Sources: OpenAI Streaming documentation, Anthropic Streaming documentation, SSE specification

Examples

  • stream=true in OpenAI API calls
  • ChatGPT UI showing text appearing word by word
  • SSE events: data: {"choices":[{"delta":{"content":"Hello"}}]}
  • Handling streaming with async generators in Python

Counterexamples

Things that might seem like Streaming but are not:

  • Waiting for complete response then displaying (non-streaming)
  • Batch processing 1000 prompts (no real-time requirement)
  • Video streaming (different domain)

Relations

Implementations

Tools and frameworks that implement this concept: