#metrics

0 blog · 0 news · 7 wiki

§03

Wiki

How many tokens a model generates per second — the most visible metric of inference speed.

The slow first response when a model or service has been idle and must initialise on demand.

The time between issuing a request and receiving a result.

The total amount of tokens, requests or jobs a system can process per unit of time.

The time between sending a request and receiving the first generated token.

How much of a model's theoretical peak FLOPs is actually delivered during real training — a key efficiency metric.

Floating-point operations per second — the classic metric for raw compute power.