Scaled dot-product attention is an attention mechanism where the dot products are scaled down by $\sqrt{d_k}$. Formally we have a query $Q$, a key $K$ and a value $V$ and calculate the attention as:
$$ {\text{Attention}}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V $$
If we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \cdot k = \sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$. Since we would prefer these values to have variance $1$, we divide by $\sqrt{d_k}$.
Source: Attention Is All You NeedPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 52 | 6.93% |
Retrieval | 38 | 5.07% |
Semantic Segmentation | 27 | 3.60% |
Question Answering | 26 | 3.47% |
Large Language Model | 25 | 3.33% |
Sentence | 14 | 1.87% |
Object Detection | 13 | 1.73% |
Image Segmentation | 12 | 1.60% |
Benchmarking | 11 | 1.47% |