AIDEX
Text Generation Inference (TGI) logo

Text Generation Inference (TGI)

by Hugging Face

Hugging Face's optimized inference server for deploying LLMs with continuous batching and flash attention

Open SourceOpen source (Apache 2.0 / HFOIL), free to use APIOpen Source api
Visit Text Generation Inference (TGI)

About Text Generation Inference (TGI)

Text Generation Inference (TGI) by Hugging Face is a production-ready inference server for large language models. It implements continuous batching, flash attention, PagedAttention, and tensor parallelism for high-throughput serving. TGI supports quantization (GPTQ, AWQ, bitsandbytes), LoRA adapters, and provides an OpenAI-compatible API.

Key Features

  • Continuous batching
  • Flash attention
  • Tensor parallelism
  • Quantization support
  • LoRA adapters
  • OpenAI-compatible API
  • Token streaming

Pros

  • Production-tested at scale
  • HuggingFace model compatibility
  • Good quantization support
  • Active development

Cons

  • Complex configuration
  • GPU infrastructure required
  • HFOIL license for some features

Tags

llm-inferencemodel-servingopen-sourcehugging-faceoptimization