AIDEX
TensorRT-LLM logo

TensorRT-LLM

by NVIDIA

NVIDIA's library for optimizing and accelerating LLM inference on NVIDIA GPUs

Open SourceOpen source (Apache 2.0), free to use with NVIDIA GPUs APIOpen Source api
Visit TensorRT-LLM

About TensorRT-LLM

TensorRT-LLM is NVIDIA's open-source library for optimizing and deploying large language models. It provides kernel optimizations, quantization, tensor parallelism, and pipelining specifically tuned for NVIDIA hardware. Combined with NVIDIA Triton Inference Server, it delivers maximum inference performance on NVIDIA GPUs.

Key Features

  • NVIDIA GPU optimization
  • INT4/INT8 quantization
  • Tensor parallelism
  • Pipeline parallelism
  • KV cache management
  • Custom plugins
  • Triton integration

Pros

  • Best performance on NVIDIA hardware
  • Deep optimization
  • Official NVIDIA support
  • Wide model support

Cons

  • NVIDIA GPUs only
  • Complex setup
  • Steep learning curve

Tags

llm-inferencenvidiagpu-optimizationopen-sourcehigh-performance