Scaled Dot-Product Attention

Introduced by Vaswani et al. in Attention Is All You Need

Scaled dot-product attention is an attention mechanism where the dot products are scaled down by $\sqrt{d_k}$. Formally we have a query $Q$, a key $K$ and a value $V$ and calculate the attention as:

$$ {\text{Attention}}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V $$

If we assume that $q$ and $k$ are $d_k$-dimensional vectors whose components are independent random variables with mean $0$ and variance $1$, then their dot product, $q \cdot k = \sum_{i=1}^{d_k} u_iv_i$, has mean $0$ and variance $d_k$. Since we would prefer these values to have variance $1$, we divide by $\sqrt{d_k}$.

Source: Attention Is All You Need

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Language Modelling	52	6.93%
Retrieval	38	5.07%
Semantic Segmentation	27	3.60%
Question Answering	26	3.47%
Large Language Model	25	3.33%
Sentence	14	1.87%
Object Detection	13	1.73%
Image Segmentation	12	1.60%
Benchmarking	11	1.47%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Softmax	Output Functions

Categories

Add Remove

Attention Mechanisms