BranchNorm: 극도로 깊은 트랜스포머 모델의 견고한 스케일링

초록

최근 DeepNorm은 Transformer를 극도로 깊은 구조(예: 1000개 계층)로 확장하며, 깊이를 증가시키는 것의 잠재력을 보여주었다. 깊은 모델의 학습을 안정화하기 위해 DeepNorm(Wang et al., 2022)은 모델 업데이트를 일정한 값으로 제한하려고 시도한다. 이러한 제약을 적용하면 모델 학습 초기 단계에는 도움이 될 수 있지만, 전체 학습 과정에서 모델이 충분히 학습되지 않을 가능성이 있다. 본 논문에서는 학습 기간에 따라 Transformer의 비잔여 분기를 동적으로 재조정하는 BranchNorm을 제안한다. BranchNorm은 이론적으로 초기 단계에서 부드러운 그래디언트 노름을 통해 학습을 안정화할 뿐만 아니라, 이후 학습 단계에서 더 나은 수렴을 촉진한다. 다중 번역 작업에 대한 실험 결과는 BranchNorm이 학습 안정성과 수렴 성능 사이에서 더 나은 균형을 달성함을 보여준다.

English

Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 layers) and reveals the promising potential of deep scaling. To stabilize the training of deep models, DeepNorm (Wang et al., 2022) attempts to constrain the model update to a constant value. Although applying such a constraint can benefit the early stage of model training, it may lead to undertrained models during the whole training procedure. In this paper, we propose BranchNorm, which dynamically rescales the non-residual branch of Transformer in accordance with the training period. BranchNorm not only theoretically stabilizes the training with smooth gradient norms at the early stage, but also encourages better convergence in the subsequent training stage. Experiment results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.

BranchNorm: 극도로 깊은 트랜스포머 모델의 견고한 스케일링

BranchNorm: Robustly Scaling Extremely Deep Transformers

초록

Support