Attention is the mechanism that lets a model learn how much weight to give different parts of its input. Bahdanau et al. introduced it for machine translation in 2014 to fix the bottleneck of long sentences, allowing the decoder to softly pick which source words mattered for each output step. In 2017, Vaswani and colleagues went a step further in 'Attention Is All You Need', removing recurrence entirely and proposing the Transformer — an architecture built on attention alone. Today, variants like Self-Attention, Cross-Attention and Multi-Head Attention sit at the heart of every modern LLM.
External Links