Speculative Decoding

Speculative decoding, introduced by Leviathan et al. in 2023, can dramatically accelerate LLM Inference. The idea is simple but powerful: a small, fast 'draft' model proposes several Tokens in advance, then the large model verifies the draft in a single parallel pass and accepts the prefix that matches its own distribution. This partly side-steps the inherently sequential nature of Autoregressive generation and typically buys 2-3x Throughput. It is becoming a standard component of modern inference stacks (vLLM, TensorRT-LLM) and especially valuable in long-context settings — and because the output distribution is statistically equivalent to the original, the speedup comes at no quality cost.