Multi-Head Attention

Introduced by Vaswani et al. in Attention Is All You Need

Multi-head Attention is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Intuitively, multiple attention heads allows for attending to parts of the sequence differently (e.g. longer-term dependencies versus shorter-term dependencies).

$$ \text{MultiHead}\left(\textbf{Q}, \textbf{K}, \textbf{V}\right) = \left[\text{head}_{1},\dots,\text{head}_{h}\right]\textbf{W}_{0}$$

$$\text{where} \text{ head}_{i} = \text{Attention} \left(\textbf{Q}\textbf{W}_{i}^{Q}, \textbf{K}\textbf{W}_{i}^{K}, \textbf{V}\textbf{W}_{i}^{V} \right) $$

Above $\textbf{W}$ are all learnable parameter matrices.

Note that scaled dot-product attention is most commonly used in this module, although in principle it can be swapped out for other types of attention mechanism.

Source: Lilian Weng

Source: Attention Is All You Need

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	54	7.15%
Retrieval	37	4.90%
Semantic Segmentation	29	3.84%
Question Answering	27	3.58%
Large Language Model	25	3.31%
Sentence	15	1.99%
Object Detection	14	1.85%
Image Segmentation	13	1.72%
Benchmarking	12	1.59%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Linear Layer	Feedforward Networks
Scaled Dot-Product Attention	Attention Mechanisms
Softmax	Output Functions

Categories

Add Remove

Attention Modules