#serving

0 blog · 0 news · 9 wiki

§03

Wiki

A technique that manages the KV cache like virtual memory pages, eliminating waste and fragmentation.

The slow first response when a model or service has been idle and must initialise on demand.

An open-source inference framework that delivers high-throughput LLM serving via PagedAttention.

Georgi Gerganov's open-source C++ project that made running LLMs locally a practical reality.

NVIDIA's open-source inference server designed to serve multiple frameworks and hardware backends.

Sending the model's response token-by-token in real time rather than waiting for the complete answer.

A dynamic serving technique where new requests can join an in-flight batch and finished ones leave immediately.

A tool that makes downloading and running LLMs on your own machine as simple as a single command.

The cache that stores previously computed key/value vectors so the model doesn't recompute them every step.