NVIDIA Triton Inference Server is an open-source inference platform NVIDIA has maintained since 2019, capable of serving many backends — PyTorch, TensorFlow, ONNX, TensorRT and others — through a single interface. Production features like dynamic batching, model ensembles, HTTP/gRPC serving and A/B model versioning have made it the de facto choice in many large enterprises. It isn't LLM-specific, but combined with TensorRT-LLM it forms a high-performance serving stack for modern language models. While it lacks the LLM-specific flexibility of vLLM, it remains the go-to option for teams that need to serve many model types behind a single infrastructure.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Advanced · 2019
NVIDIA Triton
NVIDIA's open-source inference server designed to serve multiple frameworks and hardware backends.
- EN — English term
- NVIDIA Triton
- TR — Turkish term
- NVIDIA Triton