PagedAttention

PagedAttention is the core innovation behind vLLM, introduced by Kwon et al. in 2023. Classic LLM serving allocates the KV Cache as one large contiguous block per request, an approach that wastes GPU memory dramatically — in practice 60-80% of memory often sits unused. PagedAttention borrows the idea of paged virtual memory from operating systems, splitting the KV cache into small blocks that can be shared and packed dynamically across requests. The payoff is many more concurrent requests on the same GPU, and it is one of the main reasons modern LLM inference can be served economically at all.