Knowledge Distillation

Knowledge distillation, formalised by Hinton and colleagues in 2015, is a compression technique where a smaller 'student' model is trained to mimic the behaviour and 'soft' probability outputs of a larger 'teacher'. Because the student learns not just the hard labels but the teacher's distributional information, it often performs much better than its size alone would predict. DistilBERT, TinyBERT and modern lines like Microsoft's Phi family all rely heavily on this idea. Today, Synthetic Data generation and Post-training pipelines that use a strong teacher to train a smaller student are arguably the most common modern flavour of distillation.