TensorRT

TensorRT, developed by NVIDIA since 2017, is a deeply hardware-tuned inference library and compiler for NVIDIA GPUs. It takes a trained model and aggressively accelerates it through kernel fusion, Quantization (FP16, INT8, FP8), layer fusion and calibration. The LLM-specific variant, TensorRT-LLM, layers in KV Cache optimisations, Continuous Batching, Speculative Decoding and custom kernels to target best-in-class inference performance on H100/H200. It is less flexible than open alternatives like vLLM, but it sits among the closest approaches to the physical limits of NVIDIA hardware.