Pruning

Pruning is the technique of removing weights with negligible impact in a neural network to reduce its size and compute cost — Han et al.'s 2015 work is a foundational reference. There are two main flavours: unstructured pruning zeroes out individual weights and produces sparse matrices, while structured pruning removes whole blocks like neurons or attention heads, which translates more directly into hardware speedups. In practice it is less ubiquitous than Quantization or Distillation, but it remains a valuable tool for specialised hardware and edge deployments. The most aggressive LLM-era pruning experiments live in research that tries to extract small 'capable sub-networks' from much larger models.