Researchers discover explicit registers eliminate vision transformer attention spikes

Researchers from Meta and INRIA have discovered anomalies in the inner workings of Vision Transformers (ViTs), where the models focus more on uninformative background patches rather than the main subjects of images. They traced the cause to high-norm outlier tokens, which are patch tokens with abnormally high L2 norms. These outliers appear during the training of large models and are more predictive of the full image category, despite retaining less information about their original patch.

To address this issue, the researchers proposed the addition of "register" tokens to the sequence, providing temporary storage for internal computations and preventing the hijacking of random patch embeddings. This simple tweak resulted in smoother, more semantically meaningful attention maps, minor performance boosts on various benchmarks, and improved object discovery abilities. The study highlights the importance of understanding the inner workings of neural networks to guide incremental improvements.

Key takeaways:

Researchers have discovered that Vision Transformers (ViTs) often focus on unimportant background elements in images, due to a small fraction of patch tokens having abnormally high L2 norms.
The authors of the study hypothesize that these high-norm tokens are used by the model to store temporary global information about the full image, a process they refer to as "recycling".
To prevent this recycling process from hijacking random patch embeddings, the researchers propose adding "register" tokens to the sequence, providing temporary storage for internal computations.
This simple fix not only improves the attention maps of the models, but also boosts performance on various benchmarks and greatly improves object discovery abilities.

Researchers discover explicit registers eliminate vision transformer attention spikes

Key takeaways:

Comments (0)

Newsletter