大规模Transformer训练不稳定性的小规模代理

摘要

已经训练过大型基于Transformer的模型的团队报告称，在大规模训练时出现了训练不稳定性，而在较小规模下使用相同超参数进行训练时并未出现这种情况。尽管这种不稳定性的原因具有科学意义，但要复现这种情况所需的资源量使得调查变得困难。在这项工作中，我们寻求复现和研究较小规模下的训练稳定性和不稳定性的方法。首先，我们关注先前工作中描述的两种训练不稳定性的来源：注意力层中logits的增长（Dehghani等，2023年）和输出logits与对数概率之间的发散（Chowdhery等，2022年）。通过在不同规模下测量学习率与损失之间的关系，我们展示了当以较高学习率训练时，这些不稳定性也会出现在小型模型中，并且先前在大规模下采用的缓解方法在这种情况下同样有效。这促使我们调查其他已知优化器和模型干预措施对最终损失对学习率变化的敏感性的影响程度。为此，我们研究了诸如热身、权重衰减和muParam（Yang等，2022年）等方法，并结合技术训练小型模型，使其在学习率变化的数量级上实现类似的损失。最后，在结束我们的探索时，我们研究了两种情况，即通过检查模型激活和梯度范数的缩放行为可以在其出现之前预测出不稳定性。

English

Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the muParam (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.

大规模Transformer训练不稳定性的小规模代理

Small-scale proxies for large-scale Transformer training instabilities

摘要

Support