TensorRT, developed by NVIDIA since 2017, is a deeply hardware-tuned inference library and compiler for NVIDIA GPUs. It takes a trained model and aggressively accelerates it through kernel fusion, Quantization (FP16, INT8, FP8), layer fusion and calibration. The LLM-specific variant, TensorRT-LLM, layers in KV Cache optimisations, Continuous Batching, Speculative Decoding and custom kernels to target best-in-class inference performance on H100/H200. It is less flexible than open alternatives like vLLM, but it sits among the closest approaches to the physical limits of NVIDIA hardware.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Advanced · 2017
TensorRT
NVIDIA's hardware-tuned high-performance inference library and compiler.
- EN — English term
- TensorRT
- TR — Turkish term
- TensorRT