Latency

Latency is the time between sending a request and receiving a result — a metric inherited from web and distributed-systems engineering, and one of the most product-critical numbers in any LLM stack. In an LLM context latency isn't a single number: TTFT tracks the first token, 'tail latency' tracks the slowest 1% of requests, and total completion time measures end-to-end. At equal Throughput, low tail latency is dramatically more expensive, which is why product requirements are usually expressed in percentiles like 'p95 < 2 seconds'. Streaming UIs can't hide latency, but they reshape the user's perception of waiting — which is why they have become near-standard in modern LLM products.