The article also explains the training process of the Transformer model. During training, the model's output is compared with the actual correct output from a labeled training dataset. The model's weights are then tweaked using backpropagation to make the output closer to the desired output. The article also discusses the model's loss function, which is the metric optimized during the training phase. The loss function compares two probability distributions by simply subtracting them.
Key takeaways:
- The Transformer model uses attention to boost the speed with which models can be trained and outperforms the Google Neural Machine Translation model in specific tasks.
- The model uses a concept called "self-attention" which allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for a word.
- The Transformer model also uses a mechanism called "multi-headed" attention which expands the model’s ability to focus on different positions and gives the attention layer multiple "representation subspaces".
- The model also uses positional encoding to account for the order of the words in the input sequence and each sub-layer in each encoder has a residual connection around it, followed by a layer-normalization step.