Multi-Head Attention

Multi-head attention runs the Self-Attention or Cross-Attention computation in parallel through several 'heads', each free to focus on a different kind of relationship. One head might capture syntactic dependencies, another long-range coreference, another a semantic affinity — their outputs are concatenated and projected before being passed on. It is one of the central innovations of Vaswani et al.'s 2017 Transformer design and now sits as a default ingredient in nearly every LLM. The trick makes the model much more versatile because it can carry several 'perspectives' simultaneously instead of being forced into a single attention pattern.