大型Transformer訓練不穩定性的小規模代理

摘要

訓練大型基於Transformer的模型的團隊報告，在大規模訓練時出現了訓練不穩定性，而在較小規模下使用相同的超參數進行訓練時並未出現這種情況。儘管這種不穩定性的原因具有科學價值，但要複製這些情況所需的資源量使得調查變得困難。在這項研究中，我們尋求重現和研究在較小規模下的訓練穩定性和不穩定性的方法。首先，我們專注於先前研究中描述的兩種訓練不穩定性來源：注意力層中logits的增長（Dehghani等人，2023年）和輸出logits與log概率的發散（Chowdhery等人，2022年）。通過在不同規模下測量學習率與損失之間的關係，我們展示了這些不穩定性在小型模型中當以較高學習率進行訓練時也會出現，並且在這個範疇中之前在大規模上使用的緩解方法同樣有效。這促使我們調查其他已知優化器和模型干預措施對最終損失對學習率變化的敏感性的影響程度。為此，我們研究了諸如熱身、權重衰減和muParam（Yang等人，2022年）等方法，並結合技術來訓練小型模型，在學習率變化的數量級中實現相似的損失。最後，為了總結我們的探索，我們研究了兩種情況，即通過檢查模型激活和梯度範數的縮放行為，可以在它們出現之前預測不穩定性。

English

Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the muParam (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.

大型Transformer訓練不穩定性的小規模代理

Small-scale proxies for large-scale Transformer training instabilities

摘要

Support