大规模Transformer训练不稳定性的小规模代理
Small-scale proxies for large-scale Transformer training instabilities
September 25, 2023
作者: Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith
cs.AI
摘要
已经训练过大型基于Transformer的模型的团队报告称,在大规模训练时出现了训练不稳定性,而在较小规模下使用相同超参数进行训练时并未出现这种情况。尽管这种不稳定性的原因具有科学意义,但要复现这种情况所需的资源量使得调查变得困难。在这项工作中,我们寻求复现和研究较小规模下的训练稳定性和不稳定性的方法。首先,我们关注先前工作中描述的两种训练不稳定性的来源:注意力层中logits的增长(Dehghani等,2023年)和输出logits与对数概率之间的发散(Chowdhery等,2022年)。通过在不同规模下测量学习率与损失之间的关系,我们展示了当以较高学习率训练时,这些不稳定性也会出现在小型模型中,并且先前在大规模下采用的缓解方法在这种情况下同样有效。这促使我们调查其他已知优化器和模型干预措施对最终损失对学习率变化的敏感性的影响程度。为此,我们研究了诸如热身、权重衰减和muParam(Yang等,2022年)等方法,并结合技术训练小型模型,使其在学习率变化的数量级上实现类似的损失。最后,在结束我们的探索时,我们研究了两种情况,即通过检查模型激活和梯度范数的缩放行为可以在其出现之前预测出不稳定性。
English
Teams that have trained large Transformer-based models have reported training
instabilities at large scale that did not appear when training with the same
hyperparameters at smaller scales. Although the causes of such instabilities
are of scientific interest, the amount of resources required to reproduce them
has made investigation difficult. In this work, we seek ways to reproduce and
study training stability and instability at smaller scales. First, we focus on
two sources of training instability described in previous work: the growth of
logits in attention layers (Dehghani et al., 2023) and divergence of the output
logits from the log probabilities (Chowdhery et al., 2022). By measuring the
relationship between learning rate and loss across scales, we show that these
instabilities also appear in small models when training at high learning rates,
and that mitigations previously employed at large scales are equally effective
in this regime. This prompts us to investigate the extent to which other known
optimizer and model interventions influence the sensitivity of the final loss
to changes in the learning rate. To this end, we study methods such as warm-up,
weight decay, and the muParam (Yang et al., 2022), and combine techniques to
train small models that achieve similar losses across orders of magnitude of
learning rate variation. Finally, to conclude our exploration we study two
cases where instabilities can be predicted before they emerge by examining the
scaling behavior of model activation and gradient norms.