NVIDIA Triton

NVIDIA Triton Inference Server is an open-source inference platform NVIDIA has maintained since 2019, capable of serving many backends — PyTorch, TensorFlow, ONNX, TensorRT and others — through a single interface. Production features like dynamic batching, model ensembles, HTTP/gRPC serving and A/B model versioning have made it the de facto choice in many large enterprises. It isn't LLM-specific, but combined with TensorRT-LLM it forms a high-performance serving stack for modern language models. While it lacks the LLM-specific flexibility of vLLM, it remains the go-to option for teams that need to serve many model types behind a single infrastructure.