Cross-Attention

Cross-attention is the mechanism that lets one sequence attend to information from a different sequence — in the classic Transformer, it is where the Decoder looks at the Encoder output. Unlike Self-Attention, queries come from one source while keys and values come from another, so a decoder generating Turkish translations can consult the encoder's representation of the English input. Vision-language models (VLM), diffusion image models and most text-conditioned generation systems rely on cross-attention to ground their output in the conditioning input. It is, in short, where an output sequence learns how to consume an input.