Continuous Batching

Continuous batching, brought into the mainstream by Yu et al.'s 2022 Orca paper and later popularised by vLLM, is an LLM serving technique that breaks the rigidity of static batching. Classic static batching forces all requests to start and finish together, leaving the GPU idle whenever a short request gets stuck behind a long one. With continuous batching the batch is updated at every decoding step: completed requests leave immediately and new ones slot into the freed positions. The result is a dramatic Throughput gain at the server level while per-user Latency stays roughly the same as before.