Multi-head Attention is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Intuitively, multiple attention heads allows for attending to parts of the sequence differently (e.g. longer-term dependencies versus shorter-term dependencies).
$$ \text{MultiHead}\left(\textbf{Q}, \textbf{K}, \textbf{V}\right) = \left[\text{head}_{1},\dots,\text{head}_{h}\right]\textbf{W}_{0}$$
$$\text{where} \text{ head}_{i} = \text{Attention} \left(\textbf{Q}\textbf{W}_{i}^{Q}, \textbf{K}\textbf{W}_{i}^{K}, \textbf{V}\textbf{W}_{i}^{V} \right) $$
Above $\textbf{W}$ are all learnable parameter matrices.
Note that scaled dot-product attention is most commonly used in this module, although in principle it can be swapped out for other types of attention mechanism.
Source: Lilian Weng
Source: Attention Is All You NeedPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 54 | 7.15% |
Retrieval | 37 | 4.90% |
Semantic Segmentation | 29 | 3.84% |
Question Answering | 27 | 3.58% |
Large Language Model | 25 | 3.31% |
Sentence | 15 | 1.99% |
Object Detection | 14 | 1.85% |
Image Segmentation | 13 | 1.72% |
Benchmarking | 12 | 1.59% |
Component | Type |
|
---|---|---|
Linear Layer
|
Feedforward Networks | |
Scaled Dot-Product Attention
|
Attention Mechanisms | |
Softmax
|
Output Functions |