FastHTML page

Author@author · 28m

Google grants permission to reproduce tables/figures in this paper for journalistic/scholarly use, given proper attribution.

Author@author · 28m

Abstract: We propose the Transformer, a new network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. It's more parallelizable & requires less training time.

Author@author · 28m

Transformer achieves 28.4 BLEU on WMT 2014 English-to-German, improving over existing best results by over 2 BLEU. It also establishes a new single-model state-of-the-art BLEU score of 41.8 on WMT 2014 English-to-French.

Author@author · 28m

The Transformer generalizes well to other tasks, successfully applying it to English constituency parsing with both large and limited training data.

Author@author · 28m

Recurrent neural networks (RNNs) have been the standard for sequence modeling. Efforts continue to push their boundaries.

Author@author · 28m

Recurrent models compute sequentially along input/output sequences, generating hidden states h_t based on the previous state h_{t-1} and input at position t. This prevents parallelization, crucial for longer sequences.

Author@author · 28m

Recent work has improved computational efficiency via factorization tricks and conditional computation, but the fundamental constraint of sequential computation remains.

Author@author · 28m

Attention mechanisms allow modeling dependencies regardless of distance, becoming integral to sequence models. However, they are often used with recurrent networks.

Author@author · 28m

The Transformer relies entirely on attention to draw global dependencies. It achieves state-of-the-art translation quality after training for only 12 hours on 8 P100 GPUs! #MachineLearning

Author@author · 28m

Reducing sequential computation is also the goal of Extended Neural GPU, ByteNet, and ConvS2S, which use convolutional neural networks for parallel hidden representations.

Author@author · 28m

In ConvS2S and ByteNet, relating signals from distant positions requires more operations. The Transformer reduces this to a constant number, using Multi-Head Attention.

Author@author · 28m

Self-attention relates different positions of a single sequence to compute its representation. It's been used in reading comprehension, summarization, and more.

Author@author · 28m

End-to-end memory networks use recurrent attention, performing well on question answering and language modeling.

Author@author · 28m

The Transformer is the first transduction model relying entirely on self-attention, without RNNs or convolution! We'll describe it, motivate self-attention, and discuss its advantages.

Author@author · 28m

Most sequence transduction models have an encoder-decoder structure. The encoder maps the input sequence to continuous representations. The decoder then generates an output sequence, one element at a time, auto-regressively.

Author@author · 28m

The Transformer uses stacked self-attention and fully connected layers for both encoder and decoder.

Author@author · 28m

The encoder has N=6 identical layers, each with multi-head self-attention and a feed-forward network. Residual connections and layer normalization are used.

Author@author · 28m

The decoder also has N=6 identical layers. It adds a third sub-layer for multi-head attention over the encoder output. Masking prevents attending to subsequent positions.

Author@author · 28m

Attention maps a query and key-value pairs to an output, where all are vectors. The output is a weighted sum of values, weighted by query-key compatibility.

Author@author · 28m

We use Scaled Dot-Product Attention. We compute dot products of the query with all keys, divide by sqrt(d_k), and use softmax to get weights on the values. Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V

Author@author · 28m

Additive attention is similar, but uses a feed-forward network. Dot-product is faster in practice due to optimized matrix multiplication.

Author@author · 28m

For larger d_k, additive attention usually performs better, but large dot products push softmax into regions with small gradients. Scaling helps counteract this.

Author@author · 28m

Multi-Head Attention linearly projects queries, keys, and values h times, performing attention in parallel. This allows attending to different representation subspaces. Averaging would inhibit this.

Author@author · 28m

MultiHead(Q,K,V) = Concat(head_1, ..., head_h)W^O, where head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)

Author@author · 28m

The Transformer uses multi-head attention in encoder-decoder attention, encoder self-attention, and decoder self-attention (with masking).

Author@author · 28m

The linear transformations use different parameters from layer to layer. This is equivalent to two convolutions with kernel size 1.

Author@author · 28m

FFN(x) = max(0, xW_1 + b_1)W_2 + b_2, input/output dimensionality d_model=512, inner-layer d_ff=2048.

Author@author · 28m

Learned embeddings convert input/output tokens to vectors of dimension d_model. We share the weight matrix between the two embedding layers and the pre-softmax linear transformation, multiplying by sqrt(d_model).

Author@author · 28m

Since the model contains no recurrence or convolution, positional encodings are added to input embeddings to provide info about token order.

Author@author · 28m

We use sine and cosine functions of different frequencies: PE_(pos,2i) = sin(pos/10000^(2i/d_model)), PE_(pos,2i+1) = cos(pos/10000^(2i/d_model)).

Author@author · 28m

We compare self-attention to recurrent/convolutional layers based on: computational complexity, parallelization, and path length between dependencies.

Author@author · 28m

Self-attention connects all positions with a constant number of operations, while recurrent layers require O(n) sequential operations.

Author@author · 28m

For long sequences, self-attention could be restricted to a neighborhood of size r, increasing the maximum path length to O(n/r).

Author@author · 28m

A single convolutional layer doesn't connect all pairs. This requires a stack of O(n/k) layers, or O(log_k(n)) for dilated convolutions.

Author@author · 28m

Self-attention may yield more interpretable models. Individual attention heads learn different tasks and exhibit syntactic/semantic structure.

Author@author · 28m

We trained on WMT 2014 English-German (4.5M sentence pairs, 37000 token vocabulary) and English-French (36M sentences, 32000 word-piece vocabulary).

Author@author · 28m

Base models trained for 100,000 steps (12 hours). Big models trained for 300,000 steps (3.5 days) on 8 NVIDIA P100 GPUs.

Author@author · 28m

We used the Adam optimizer with beta_1=0.9, beta_2=0.98, epsilon=10^{-9}, and a learning rate schedule to increase then decrease the rate.

Author@author · 28m

We used dropout (P_drop=0.1) and label smoothing (epsilon_ls=0.1) for regularization. Label smoothing hurts perplexity but improves accuracy/BLEU.

Author@author · 28m

The big Transformer model outperforms previous models on WMT 2014 English-to-German by over 2.0 BLEU (28.4 BLEU). Our base model surpasses previous models at lower training cost!

Author@author · 28m

Our big model achieves 41.0 BLEU on WMT 2014 English-to-French, outperforming previous single models at <1/4 the training cost!

Author@author · 28m

For base models, we averaged the last 5 checkpoints; for big models, the last 20. We used beam search (beam size 4, length penalty 0.6).

Author@author · 28m

Self-attention can yield more interpretable models. We'll show attention distributions that exhibit syntactic/semantic structure.

Author@author · 28m

Varying attention heads and key/value dimensions impacts performance. Single-head attention is worse; too many heads also drops quality.

Author@author · 28m

Reducing attention key size hurts model quality, suggesting determining compatibility is difficult and dot product may not be sufficient.

Author@author · 28m

Bigger models perform better, and dropout is very helpful in avoiding over-fitting. Sinusoidal and learned position embeddings perform similarly.

Author@author · 28m

The Transformer generalizes well to English constituency parsing. It outperforms previous models in small-data regimes.

Author@author · 28m

Even training only on 40k sentences from the WSJ training set, the Transformer outperforms the BerkeleyParser.

Author@author · 28m

We presented the Transformer, the first sequence transduction model based entirely on attention, replacing recurrent layers! It can be trained faster than RNNs/CNNs and achieves state of the art results!

Author@author · 28m

We plan to apply attention-based models to other tasks, input/output modalities, and to investigate local attention mechanisms for large inputs/outputs.

Subscribe to Premium

What's happening

Who to follow