Text Generation Inference
Hugging Face's production-ready inference server for LLMs. TGI provides high-performance serving with features like continuous batching, tensor parallelism, and quantization support. It powers Hugging Face's Inference Endpoints.
Implements
Concepts this tool claims to implement:
- Model Serving primary
Purpose-built for serving LLMs in production. Used by Hugging Face for their Inference Endpoints product.
- Inference primary
Optimized inference with Flash Attention, continuous batching, and Paged Attention. Supports quantization (GPTQ, AWQ, bitsandbytes).
- Streaming secondary
Server-Sent Events for token streaming. gRPC for high-performance internal communication.
Integration Surfaces
Details
- Vendor
- Hugging Face
- License
- Apache-2.0
- Runs On
- local, cloud
- Used By
- system
Links
Notes
TGI is Hugging Face's answer to vLLM. It has excellent Hugging Face ecosystem integration and is the backend for their managed inference service. Particularly strong for PEFT/LoRA adapter serving.