Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Curriculum learning Rule-based learning Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

The machine learning-based attention method simulates how human attention works by assigning varying levels of importance to different words in a sentence. It assigns importance to each word by calculating "soft" weights for the word's numerical representation, known as its embedding, within a specific section of the sentence called the context window to determine its importance. The calculation of these weights can occur simultaneously in models called transformers, or one by one in models known as recurrent neural networks. Unlike "hard" weights, which are predetermined and fixed during training, "soft" weights can adapt and change with each use of the model.

Attention was developed to address the weaknesses of leveraging information from the hidden layers of recurrent neural networks. Recurrent neural networks favor more recent information contained in words at the end of a sentence, while information earlier in the sentence tends to be attenuated. Attention allows the calculation of the hidden representation of a token equal access to any part of a sentence directly, rather than only through the previous hidden state.

Earlier uses attached this mechanism to a serial recurrent neural network's language translation system (below), but later uses in transformers' large language models removed the recurrent neural network and relied heavily on the faster parallel attention scheme.

Predecessors

Predecessors of the mechanism were used in recurrent neural networks which, however, calculated "soft" weights sequentially and, at each step, considered the current word and other words within the context window. They were known as multiplicative modules, sigma pi units,^[1] and hyper-networks.^[2] They have been used in long short-term memory (LSTM) networks, multi-sensory data processing (sound, images, video, and text) in perceivers, fast weight controller's memory,^[3] reasoning tasks in differentiable neural computers, and neural Turing machines.^[4]^[5]^[6]^[7]^[8]

Core calculations

The attention network was designed to identify the highest correlations amongst words within a sentence, assuming that it has learned those patterns from the training corpus. This correlation is captured in neuronal weights through backpropagation, either from self-supervised pretraining or supervised fine-tuning.

The example below (an encoder-only QKV variant of an attention network) shows how correlations are identified once a network has been trained and has the right weights. When looking at the word "that" in the sentence "see that girl run", the network should be able to identify "girl" as a highly correlated word. For simplicity this example focuses on the word "that", but in reality all words receive this treatment in parallel and the resulting soft-weights and context vectors are stacked into matrices for further task-specific use.

The Q_w and K_w sub-networks of a single "attention head" calculate the soft weights, originating from the word "that". (Encoder-only QKV variant).

The sentence is sent through 3 parallel streams (left), which emerge at the end as the context vector (right). The word embedding size is 300 and the neuron count is 100 in each sub-network of the attention head.

The capital letter $X$ denotes a matrix sized 4 × 300, consisting of the embeddings of all four words.
The small underlined letter $x$ denotes the embedding vector (sized 300) of the word "that".
The attention head includes three (vertically arranged in the illustration) sub-networks, each having 100 neurons, being $W q$ , $W k$ and $W v$ their respective weight matrices, all them sized 300 × 100.
$q$ (from "query") is a vector sized 100, $K$ ("key") and $V$ ("value") are 4x100 matrices.
The asterisk within parenthesis " $(*)$ " denotes the $softmax( qW k / \sqrt 100)$ . Softmax result is a vector sized 4 that later on is multiplied by the matrix $V=XW v$ to obtain the context vector.
Rescaling by √100 prevents a high variance in $qW k T$ that would allow a single word to excessively dominate the softmax resulting in attention to only one word, as a discrete hard max would do.

Notation: the commonly written row-wise $softmax$ formula above assumes that vectors are rows, which contradicts the standard math notation of column vectors. More correctly, we should take the transpose of the context vector and use the column-wise $softmax$ , resulting in the more correct form

{\textrm {Context))=(XW_{v})^{T}\times \mathrm {softmax} \left((W_{k}X^{T})\times ({\underline {x))W_{q})^{T}/{\sqrt {100))\right)

The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.^[5]

The structure of the input data is captured in the $W q$ and $W k$ weights, and the $W v$ weights express that structure in terms of more meaningful features for the task being trained for. For this reason, the attention head components are called Query ( $W q$ ), Key ( $W k$ ), and Value ( $W v$ )—a loose and possibly misleading analogy with relational database systems.

Note that the context vector for "that" does not rely on context vectors for the other words; therefore the context vectors of all words can be calculated using the whole matrix $X$ , which includes all the word embeddings, instead of a single word's embedding vector $x$ in the formula above, thus parallelizing the calculations. Now, the softmax can be interpreted as a matrix softmax acting on separate rows. This is a huge advantage over recurrent networks which must operate sequentially.

The common query-key analogy with database queries suggests an asymmetric role for these vectors, where one item of interest (the query) is matched against all possible items (the keys). However, parallel calculations matches all words of the sentence with itself; therefore the roles of these vectors are symmetric. Possibly because the simplistic database analogy is flawed, much effort has gone into understand Attention further by studying their roles in focused settings, such as in-context learning,^[9] masked language tasks,^[10] stripped down transformers,^[11] bigram statistics,^[12] pairwise convolutions,^[13] and arithmetic factoring.^[14]

A language translation example

To build a machine that translates English to French, an attention unit is grafted to the basic Encoder-Decoder (diagram below). In the simplest case, the attention unit consists of dot products of the recurrent encoder states and does not need training. In practice, the attention unit consists of 3 trained, fully-connected neural network layers called query, key, and value.

A step-by-step sequence of a language translation.

Encoder-decoder with attention.^[15] The left part (black lines) is the encoder-decoder, the middle part (orange lines) is the attention unit, and the right part (in grey & colors) is the computed data. Grey regions in H matrix and w vector are zero values. Numerical subscripts indicate vector sizes while lettered subscripts i and i − 1 indicate time steps.

Legend
Label	Description
100	Max. sentence length
300	Embedding size (word dimension)
500	Length of hidden vector
9k, 10k	Dictionary size of input & output languages respectively.
x, Y	9k and 10k 1-hot dictionary vectors. x → x implemented as a lookup table rather than vector multiplication. Y is the 1-hot maximizer of the linear Decoder layer D; that is, it takes the argmax of D's linear layer output.
x	300-long word embedding vector. The vectors are usually pre-calculated from other projects such as GloVe or Word2Vec.
h	500-long encoder hidden vector. At each point in time, this vector summarizes all the preceding words before it. The final h can be viewed as a "sentence" vector, or a thought vector as Hinton calls it.
s	500-long decoder hidden state vector.
E	500 neuron recurrent neural network encoder. 500 outputs. Input count is 800–300 from source embedding + 500 from recurrent connections. The encoder feeds directly into the decoder only to initialize it, but not thereafter; hence, that direct connection is shown very faintly.
D	2-layer decoder. The recurrent layer has 500 neurons and the fully-connected linear layer has 10k neurons (the size of the target vocabulary).^[16] The linear layer alone has 5 million (500 × 10k) weights – ~10 times more weights than the recurrent layer.
score	100-long alignment score
w	100-long vector attention weight. These are "soft" weights which changes during the forward pass, in contrast to "hard" neuronal weights that change during the learning phase.
A	Attention module – this can be a dot product of recurrent states, or the query-key-value fully-connected layers. The output is a 100-long vector w.
H	500×100. 100 hidden vectors h concatenated into a matrix
c	500-long context vector = H * w. c is a linear combination of h vectors weighted by w.

Viewed as a matrix, the attention weights show how the network adjusts its focus according to context.^[17]

	I	love	you
je	0.94	0.02	0.04
t'	0.11	0.01	0.88
aime	0.03	0.95	0.02

This view of the attention weights addresses the neural network "explainability" problem. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. On the first pass through the decoder, 94% of the attention weight is on the first English word "I", so the network offers the word "je". On the second pass of the decoder, 88% of the attention weight is on the third English word "you", so it offers "t'". On the last pass, 95% of the attention weight is on the second English word "love", so it offers "aime".

Variants

Many variants of attention implement soft weights, such as

"internal spotlights of attention"^[18] generated by fast weight programmers or fast weight controllers (1992)^[3] (also known as transformers with "linearized self-attention"^[19]^[20]). A slow neural network learns by gradient descent to program the fast weights of another neural network through outer products of self-generated activation patterns called "FROM" and "TO" which in transformer terminology are called "key" and "value." This fast weight "attention mapping" is applied to queries.
Bahdanau-style attention,^[17] also referred to as additive attention,
Luong-style attention,^[21] which is known as multiplicative attention,
highly parallelizable self-attention introduced in 2016 as decomposable attention^[22] and successfully used in transformers a year later,
positional attention and factorized positional attention.^[23]

For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention,^[24] channel attention,^[25] or combinations.^[26]^[27]

These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Core Calculations section above.

1. encoder-decoder dot product	2. encoder-decoder QKV	3. encoder-only dot product	4. encoder-only QKV	5. Pytorch tutorial
Both encoder & decoder are needed to calculate attention.^[21]	Both encoder & decoder are needed to calculate attention.^[28]	Decoder is not used to calculate attention. With only 1 input into corr, W is an auto-correlation of dot products. w_ij = x_i x_j^[29]	Decoder is not used to calculate attention.^[30]	A fully-connected layer is used to calculate attention instead of dot product correlation.^[31]

Legend
Label	Description
Variables X, H, S, T	Upper case variables represent the entire sentence, and not just the current word. For example, H is a matrix of the encoder hidden state—one word per column.
S, T	S, decoder hidden state; T, target word embedding. In the Pytorch Tutorial variant training phase, T alternates between 2 sources depending on the level of teacher forcing used. T could be the embedding of the network's output word; i.e. embedding(argmax(FC output)). Alternatively with teacher forcing, T could be the embedding of the known correct word which can occur with a constant forcing probability, say 1/2.
X, H	H, encoder hidden state; X, input word embeddings.
W	Attention coefficients
Qw, Kw, Vw, FC	Weight matrices for query, key, value respectively. FC is a fully-connected weight matrix.
⊕, ⊗	⊕, vector concatenation; ⊗, matrix multiplication.
corr	Column-wise softmax(matrix of all combinations of dot products). The dot products are *x_i x_j in variant #3, h_i* s*_j in variant 1, and column _i ( Kw H ) * column _j ( Qw * S ) in variant 2, and column _i ( Kw * X ) * column _j ( Qw * X ) in variant 4. Variant 5 uses a fully-connected layer to determine the coefficients. If the variant is QKV, then the dot products are normalized by the $\sqrt d$ where $d$ is the height of the QKV matrices.

Mathematical representation

Standard Scaled Dot-Product Attention

${\text{Attention))(Q,K,V)={\text{softmax))\left({\frac {QK^{T)){\sqrt {d_{k))))\right)V$ where $Q,K,V$ are the query, key, and value matrices, ${\displaystyle d_{k))$ is the dimension of the keys. Value vectors in matrix $V$ are weighted using the weights resulting from the softmax operation.

Multi-Head Attention

${\displaystyle {\text{MultiHead))(Q,K,V)={\text{Concat))({\text{head))_{1},...,{\text{head))_{h})W^{O))$ where each head is computed as: ${\text{head))_{i}={\text{Attention))(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$ and ${\displaystyle W_{i}^{Q},W_{i}^{K},W_{i}^{V))$ , and ${\displaystyle W^{O))$ are parameter matrices.

Bahdanau (Additive) Attention

${\text{Attention))(Q,K,V)={\text{softmax))(e)V$ where $e=\tanh(W_{Q}Q+W_{K}K)$ and ${\displaystyle W_{Q))$ and ${\displaystyle W_{K))$ are learnable weight matrices.^[17]

Luong Attention (General)

${\text{Attention))(Q,K,V)={\text{softmax))(QW_{a}K^{T})V$ where ${\displaystyle W_{a))$ is a learnable weight matrix.^[21]

References

External links

Differentiable computing

General

Concepts

Applications

Hardware

Software libraries

Implementations

Audio–visual	AlexNet WaveNet Human image synthesis HWR OCR Speech synthesis Speech recognition Facial recognition AlphaFold Text-to-image models DALL-E Midjourney Stable Diffusion Text-to-video models Sora VideoPoet Whisper
Verbal	Word2vec Seq2seq BERT Gemini LaMDA Bard NMT Project Debater IBM Watson IBM Watsonx Granite GPT-1 GPT-2 GPT-3 GPT-4 ChatGPT GPT-J Chinchilla AI PaLM BLOOM LLaMA PanGu-Σ
Decisional	AlphaGo AlphaZero Q-learning SARSA OpenAI Five Self-driving car MuZero Action selection Auto-GPT Robot control

People

Organizations

Architectures

Portals
- Computer programming
- Technology
Categories
- Artificial neural networks
- Machine learning