Google grants permission to reproduce tables/figures in this paper for journalistic/scholarly use, given proper attribution.
Abstract: We propose the Transformer, a new network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. It's more parallelizable & requires less training time.
Transformer achieves 28.4 BLEU on WMT 2014 English-to-German, improving over existing best results by over 2 BLEU. It also establishes a new single-model state-of-the-art BLEU score of 41.8 on WMT 2014 English-to-French.
The Transformer generalizes well to other tasks, successfully applying it to English constituency parsing with both large and limited training data.
Recurrent neural networks (RNNs) have been the standard for sequence modeling. Efforts continue to push their boundaries.
Recurrent models compute sequentially along input/output sequences, generating hidden states h_t based on the previous state h_{t-1} and input at position t. This prevents parallelization, crucial for longer sequences.
Recent work has improved computational efficiency via factorization tricks and conditional computation, but the fundamental constraint of sequential computation remains.
Attention mechanisms allow modeling dependencies regardless of distance, becoming integral to sequence models. However, they are often used with recurrent networks.
The Transformer relies entirely on attention to draw global dependencies. It achieves state-of-the-art translation quality after training for only 12 hours on 8 P100 GPUs! #MachineLearning
Reducing sequential computation is also the goal of Extended Neural GPU, ByteNet, and ConvS2S, which use convolutional neural networks for parallel hidden representations.
In ConvS2S and ByteNet, relating signals from distant positions requires more operations. The Transformer reduces this to a constant number, using Multi-Head Attention.
Self-attention relates different positions of a single sequence to compute its representation. It's been used in reading comprehension, summarization, and more.
End-to-end memory networks use recurrent attention, performing well on question answering and language modeling.
The Transformer is the first transduction model relying entirely on self-attention, without RNNs or convolution! We'll describe it, motivate self-attention, and discuss its advantages.
Most sequence transduction models have an encoder-decoder structure. The encoder maps the input sequence to continuous representations. The decoder then generates an output sequence, one element at a time, auto-regressively.
The Transformer uses stacked self-attention and fully connected layers for both encoder and decoder.
The encoder has N=6 identical layers, each with multi-head self-attention and a feed-forward network. Residual connections and layer normalization are used.
The decoder also has N=6 identical layers. It adds a third sub-layer for multi-head attention over the encoder output. Masking prevents attending to subsequent positions.
Attention maps a query and key-value pairs to an output, where all are vectors. The output is a weighted sum of values, weighted by query-key compatibility.
We use Scaled Dot-Product Attention. We compute dot products of the query with all keys, divide by sqrt(d_k), and use softmax to get weights on the values. Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V
Additive attention is similar, but uses a feed-forward network. Dot-product is faster in practice due to optimized matrix multiplication.
For larger d_k, additive attention usually performs better, but large dot products push softmax into regions with small gradients. Scaling helps counteract this.
Multi-Head Attention linearly projects queries, keys, and values h times, performing attention in parallel. This allows attending to different representation subspaces. Averaging would inhibit this.
MultiHead(Q,K,V) = Concat(head_1, ..., head_h)W^O, where head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)
The Transformer uses multi-head attention in encoder-decoder attention, encoder self-attention, and decoder self-attention (with masking).
The linear transformations use different parameters from layer to layer. This is equivalent to two convolutions with kernel size 1.
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2, input/output dimensionality d_model=512, inner-layer d_ff=2048.
Learned embeddings convert input/output tokens to vectors of dimension d_model. We share the weight matrix between the two embedding layers and the pre-softmax linear transformation, multiplying by sqrt(d_model).
Since the model contains no recurrence or convolution, positional encodings are added to input embeddings to provide info about token order.
We use sine and cosine functions of different frequencies: PE_(pos,2i) = sin(pos/10000^(2i/d_model)), PE_(pos,2i+1) = cos(pos/10000^(2i/d_model)).
We compare self-attention to recurrent/convolutional layers based on: computational complexity, parallelization, and path length between dependencies.
Self-attention connects all positions with a constant number of operations, while recurrent layers require O(n) sequential operations.
For long sequences, self-attention could be restricted to a neighborhood of size r, increasing the maximum path length to O(n/r).
A single convolutional layer doesn't connect all pairs. This requires a stack of O(n/k) layers, or O(log_k(n)) for dilated convolutions.
Self-attention may yield more interpretable models. Individual attention heads learn different tasks and exhibit syntactic/semantic structure.
We trained on WMT 2014 English-German (4.5M sentence pairs, 37000 token vocabulary) and English-French (36M sentences, 32000 word-piece vocabulary).
Base models trained for 100,000 steps (12 hours). Big models trained for 300,000 steps (3.5 days) on 8 NVIDIA P100 GPUs.
We used the Adam optimizer with beta_1=0.9, beta_2=0.98, epsilon=10^{-9}, and a learning rate schedule to increase then decrease the rate.
We used dropout (P_drop=0.1) and label smoothing (epsilon_ls=0.1) for regularization. Label smoothing hurts perplexity but improves accuracy/BLEU.
The big Transformer model outperforms previous models on WMT 2014 English-to-German by over 2.0 BLEU (28.4 BLEU). Our base model surpasses previous models at lower training cost!
Our big model achieves 41.0 BLEU on WMT 2014 English-to-French, outperforming previous single models at <1/4 the training cost!
For base models, we averaged the last 5 checkpoints; for big models, the last 20. We used beam search (beam size 4, length penalty 0.6).
Self-attention can yield more interpretable models. We'll show attention distributions that exhibit syntactic/semantic structure.
Varying attention heads and key/value dimensions impacts performance. Single-head attention is worse; too many heads also drops quality.
Reducing attention key size hurts model quality, suggesting determining compatibility is difficult and dot product may not be sufficient.
Bigger models perform better, and dropout is very helpful in avoiding over-fitting. Sinusoidal and learned position embeddings perform similarly.
The Transformer generalizes well to English constituency parsing. It outperforms previous models in small-data regimes.
Even training only on 40k sentences from the WSJ training set, the Transformer outperforms the BerkeleyParser.
We presented the Transformer, the first sequence transduction model based entirely on attention, replacing recurrent layers! It can be trained faster than RNNs/CNNs and achieves state of the art results!
We plan to apply attention-based models to other tasks, input/output modalities, and to investigate local attention mechanisms for large inputs/outputs.
Subscribe to unlock new features and if eligible, receive a share of revenue.
Khloé in Wonder Land
Sora
51.7K posts
Business and Finance · Trending
$NVDA
14.7K posts
Politics · Trending
UNBELIEVABLE
52.2K posts
NSA/CSS
@NSAGov