分支標準化:穩健地擴展極深的Transformer
BranchNorm: Robustly Scaling Extremely Deep Transformers
May 4, 2023
作者: Yijin Liu, Xianfeng Zeng, Fandong Meng, Jie Zhou
cs.AI
摘要
最近,DeepNorm 將 Transformer 擴展到極深層(即 1000 層),展示了深度擴展的潛力。為了穩定深度模型的訓練,DeepNorm(Wang 等人,2022)嘗試將模型更新限制為一個恆定值。儘管應用這種限制可以使模型訓練的早期階段受益,但可能導致整個訓練過程中模型訓練不足。在本文中,我們提出了 BranchNorm,根據訓練週期動態地重新調整 Transformer 的非殘差分支。BranchNorm 不僅在早期階段在理論上穩定了訓練,還鼓勵在後續訓練階段更好地收斂。在多個翻譯任務上的實驗結果表明,BranchNorm 在訓練穩定性和收斂性能之間取得了更好的平衡。
English
Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000
layers) and reveals the promising potential of deep scaling. To stabilize the
training of deep models, DeepNorm (Wang et al., 2022) attempts to constrain the
model update to a constant value. Although applying such a constraint can
benefit the early stage of model training, it may lead to undertrained models
during the whole training procedure. In this paper, we propose BranchNorm,
which dynamically rescales the non-residual branch of Transformer in accordance
with the training period. BranchNorm not only theoretically stabilizes the
training with smooth gradient norms at the early stage, but also encourages
better convergence in the subsequent training stage. Experiment results on
multiple translation tasks demonstrate that BranchNorm achieves a better
trade-off between training stability and converge performance.