Throughput

Throughput is the total amount of work a system can process per unit time — in the LLM world, usually expressed as total Tokens per second or requests per second across the whole server. While TPS measures how fast one user gets a reply, throughput captures the system's aggregate capacity, and the two metrics often pull in opposite directions. Techniques like Continuous Batching, PagedAttention and efficient KV Cache management are designed to grow throughput without ballooning per-user Latency. It is one of the headline numbers in MLPerf-style benchmarks because it is the most realistic proxy for cost.