#performance

0 blog · 0 news · 6 wiki

§03

Wiki

The competitive period that emerged in 2024 around inference providers competing on tokens per second (TPS).

A feature that caches large recurring prompt prefixes for major cost and latency savings.

How many tokens a model generates per second — the most visible metric of inference speed.

An inference speedup where a small draft model proposes multiple tokens that the big model then verifies in parallel.

The total amount of tokens, requests or jobs a system can process per unit of time.

The time between sending a request and receiving the first generated token.