Multi-head attention runs the Self-Attention or Cross-Attention computation in parallel through several 'heads', each free to focus on a different kind of relationship. One head might capture syntactic dependencies, another long-range coreference, another a semantic affinity — their outputs are concatenated and projected before being passed on. It is one of the central innovations of Vaswani et al.'s 2017 Transformer design and now sits as a default ingredient in nearly every LLM. The trick makes the model much more versatile because it can carry several 'perspectives' simultaneously instead of being forced into a single attention pattern.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Intermediate · 2017
Multi-Head Attention
A version of attention where multiple parallel 'heads' learn different relationships at the same time.
- EN — English term
- Multi-head Attention
- TR — Turkish term
- Çok-Başlı Dikkat