TY - JOUR
T1 - Stabilizing Transformer Training by Preventing Attention Entropy Collapse
AU - Zhai, Shuangfei
AU - Likhomanenko, Tatiana
AU - Littwin, Etai
AU - Busbridge, Dan
AU - Ramapuram, Jason
AU - Zhang, Yizhe
AU - Gu, Jiatao
AU - Susskind, Josh
N1 - Publisher Copyright:
© 2023 Proceedings of Machine Learning Research. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as entropy collapse. As a remedy, we propose σReparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that σReparam successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with σReparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks. We show that σReparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer to competitive performance without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers. Code is available at https://github.com/apple/ml-sigma-reparam.
AB - Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as entropy collapse. As a remedy, we propose σReparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that σReparam successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with σReparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks. We show that σReparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer to competitive performance without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers. Code is available at https://github.com/apple/ml-sigma-reparam.
UR - http://www.scopus.com/inward/record.url?scp=85174404290&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:85174404290
SN - 2640-3498
VL - 202
SP - 40770
EP - 40803
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 40th International Conference on Machine Learning, ICML 2023
Y2 - 23 July 2023 through 29 July 2023
ER -