Speculative decoding, introduced by Leviathan et al. in 2023, can dramatically accelerate LLM Inference. The idea is simple but powerful: a small, fast 'draft' model proposes several Tokens in advance, then the large model verifies the draft in a single parallel pass and accepts the prefix that matches its own distribution. This partly side-steps the inherently sequential nature of Autoregressive generation and typically buys 2-3x Throughput. It is becoming a standard component of modern inference stacks (vLLM, TensorRT-LLM) and especially valuable in long-context settings — and because the output distribution is statistically equivalent to the original, the speedup comes at no quality cost.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Advanced · 2023
Speculative Decoding
An inference speedup where a small draft model proposes multiple tokens that the big model then verifies in parallel.
- EN — English term
- Speculative Decoding
- TR — Turkish term
- Spekülatif Çözme