대규모 트랜스포머 훈련 불안정성에 대한 소규모 프록시

초록

대규모 Transformer 기반 모델을 훈련한 팀들은 작은 규모에서 동일한 하이퍼파라미터로 훈련할 때는 나타나지 않던 불안정성이 대규모에서 발생한다고 보고했다. 이러한 불안정성의 원인은 과학적 관심사이지만, 이를 재현하는 데 필요한 자원의 양으로 인해 조사가 어려웠다. 본 연구에서는 작은 규모에서 훈련 안정성과 불안정성을 재현하고 연구할 방법을 모색한다. 먼저, 이전 연구에서 설명된 두 가지 훈련 불안정성 원인에 초점을 맞춘다: 어텐션 레이어에서 로짓의 증가(Dehghani et al., 2023)와 출력 로짓이 로그 확률에서 벗어나는 현상(Chowdhery et al., 2022). 학습률과 손실 간의 관계를 다양한 규모에서 측정함으로써, 이러한 불안정성이 높은 학습률로 작은 모델을 훈련할 때도 나타나며, 대규모에서 사용된 완화 기법이 이 영역에서도 동일하게 효과적임을 보인다. 이는 다른 알려진 최적화기 및 모델 개입이 최종 손실의 학습률 변화에 대한 민감도에 어느 정도 영향을 미치는지 조사하도록 이끈다. 이를 위해 워밍업, 가중치 감쇠, muParam(Yang et al., 2022)과 같은 방법을 연구하고, 학습률 변동의 크기에 걸쳐 유사한 손실을 달성하는 작은 모델을 훈련하기 위해 기법들을 결합한다. 마지막으로, 모델 활성화 및 그래디언트 노름의 스케일링 행동을 검토함으로써 불안정성이 발생하기 전에 예측할 수 있는 두 가지 사례를 연구하여 탐구를 마무리한다.

English

Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the muParam (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.

대규모 트랜스포머 훈련 불안정성에 대한 소규모 프록시

Small-scale proxies for large-scale Transformer training instabilities

초록

Support