Text Generation Inference

tool active open-source

Hugging Face's production-ready inference server for LLMs. TGI provides high-performance serving with features like continuous batching, tensor parallelism, and quantization support. It powers Hugging Face's Inference Endpoints.

Implements

Concepts this tool claims to implement:

Model Serving primary

Purpose-built for serving LLMs in production. Used by Hugging Face for their Inference Endpoints product.
Inference primary

Optimized inference with Flash Attention, continuous batching, and Paged Attention. Supports quantization (GPTQ, AWQ, bitsandbytes).
Streaming secondary

Server-Sent Events for token streaming. gRPC for high-performance internal communication.

Integration Surfaces

Details

Vendor: Hugging Face
License: Apache-2.0
Runs On: local, cloud
Used By: system

Notes

TGI is Hugging Face's answer to vLLM. It has excellent Hugging Face ecosystem integration and is the backend for their managed inference service. Particularly strong for PEFT/LoRA adapter serving.

Implements

Integration Surfaces

Details

Links

Notes

Related Tools