UPDATED FEB 2026Submit

TensorRT-LLM

by NVIDIA

NVIDIA's library for optimizing and accelerating LLM inference on NVIDIA GPUs

Open SourceOpen source (Apache 2.0), free to use with NVIDIA GPUs APIOpen Source api

Visit TensorRT-LLM

About TensorRT-LLM

TensorRT-LLM is NVIDIA's open-source library for optimizing and deploying large language models. It provides kernel optimizations, quantization, tensor parallelism, and pipelining specifically tuned for NVIDIA hardware. Combined with NVIDIA Triton Inference Server, it delivers maximum inference performance on NVIDIA GPUs.

Key Features

NVIDIA GPU optimization
INT4/INT8 quantization
Tensor parallelism
Pipeline parallelism
KV cache management
Custom plugins
Triton integration

Pros

Best performance on NVIDIA hardware
Deep optimization
Official NVIDIA support
Wide model support

Cons

NVIDIA GPUs only
Complex setup
Steep learning curve

Tags

llm-inferencenvidiagpu-optimizationopen-sourcehigh-performance

Alternatives to TensorRT-LLM

High-throughput and memory-efficient inference engine for LLMs with PagedAttention technology

Text Generation Inference (TGI)

Hugging Face's optimized inference server for deploying LLMs with continuous batching and flash attention

Efficient C/C++ implementation for running LLMs on consumer hardware with quantization support

More Developer Infrastructure ToolsView All

The leading open-source platform for sharing, discovering, and deploying ML models, datasets, and Spaces

Open-source framework for building LLM-powered applications with chains, agents, and retrieval-augmented generation

Managed vector database for building high-performance AI applications with similarity search at scale

Run and deploy open-source ML models in the cloud with a simple API, no infrastructure needed

Weights & Biases (W&B)

ML experiment tracking, model versioning, and dataset management platform for AI teams

Open-source vector database with built-in vectorization modules and hybrid search capabilities