分支规范化:稳健地扩展极深的Transformer
BranchNorm: Robustly Scaling Extremely Deep Transformers
May 4, 2023
作者: Yijin Liu, Xianfeng Zeng, Fandong Meng, Jie Zhou
cs.AI
摘要
最近,DeepNorm 将 Transformer 扩展到极深层(即 1000 层),展示了深度扩展的潜在优势。为了稳定深度模型的训练,DeepNorm(Wang 等,2022)尝试将模型更新限制为一个恒定值。尽管应用这种约束可以使模型训练的早期阶段受益,但可能导致整个训练过程中模型训练不足。在本文中,我们提出了 BranchNorm,它根据训练阶段动态重新调整 Transformer 的非残差分支。BranchNorm 不仅在早期阶段理论上稳定了训练,而且在后续训练阶段鼓励更好的收敛。在多个翻译任务上的实验结果表明,BranchNorm 在训练稳定性和收敛性能之间取得了更好的平衡。
English
Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000
layers) and reveals the promising potential of deep scaling. To stabilize the
training of deep models, DeepNorm (Wang et al., 2022) attempts to constrain the
model update to a constant value. Although applying such a constraint can
benefit the early stage of model training, it may lead to undertrained models
during the whole training procedure. In this paper, we propose BranchNorm,
which dynamically rescales the non-residual branch of Transformer in accordance
with the training period. BranchNorm not only theoretically stabilizes the
training with smooth gradient norms at the early stage, but also encourages
better convergence in the subsequent training stage. Experiment results on
multiple translation tasks demonstrate that BranchNorm achieves a better
trade-off between training stability and converge performance.