Inference

Inference is what happens when a trained model takes a new input and produces an output — the moment an LLM reads a prompt and writes a reply. Training is a one-time (and very expensive) investment, but inference happens over and over for the lifetime of the model, which is where real operational cost accumulates. Metrics like TTFT, TPS, Throughput and Latency — and techniques like KV Cache management and Continuous Batching — all exist to make this phase cheaper and faster. Inference stacks like vLLM, NVIDIA Triton and TensorRT may be less glamorous than training, but they are at least as decisive on the product side.