Text Generation Inference

tool active open-source

Hugging Face's production-ready inference server for LLMs. TGI provides high-performance serving with features like continuous batching, tensor parallelism, and quantization support. It powers Hugging Face's Inference Endpoints.

Implements

Concepts this tool claims to implement:

  • Purpose-built for serving LLMs in production. Used by Hugging Face for their Inference Endpoints product.

  • Inference primary

    Optimized inference with Flash Attention, continuous batching, and Paged Attention. Supports quantization (GPTQ, AWQ, bitsandbytes).

  • Streaming secondary

    Server-Sent Events for token streaming. gRPC for high-performance internal communication.

Integration Surfaces

  • REST API
  • gRPC API
  • Docker
  • Hugging Face Inference Endpoints

Details

Vendor
Hugging Face
License
Apache-2.0
Runs On
local, cloud
Used By
system

Notes

TGI is Hugging Face's answer to vLLM. It has excellent Hugging Face ecosystem integration and is the backend for their managed inference service. Particularly strong for PEFT/LoRA adapter serving.