分支標準化：穩健地擴展極深的Transformer

摘要

最近，DeepNorm 將 Transformer 擴展到極深層（即 1000 層），展示了深度擴展的潛力。為了穩定深度模型的訓練，DeepNorm（Wang 等人，2022）嘗試將模型更新限制為一個恆定值。儘管應用這種限制可以使模型訓練的早期階段受益，但可能導致整個訓練過程中模型訓練不足。在本文中，我們提出了 BranchNorm，根據訓練週期動態地重新調整 Transformer 的非殘差分支。BranchNorm 不僅在早期階段在理論上穩定了訓練，還鼓勵在後續訓練階段更好地收斂。在多個翻譯任務上的實驗結果表明，BranchNorm 在訓練穩定性和收斂性能之間取得了更好的平衡。

English

Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 layers) and reveals the promising potential of deep scaling. To stabilize the training of deep models, DeepNorm (Wang et al., 2022) attempts to constrain the model update to a constant value. Although applying such a constraint can benefit the early stage of model training, it may lead to undertrained models during the whole training procedure. In this paper, we propose BranchNorm, which dynamically rescales the non-residual branch of Transformer in accordance with the training period. BranchNorm not only theoretically stabilizes the training with smooth gradient norms at the early stage, but also encourages better convergence in the subsequent training stage. Experiment results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.

分支標準化：穩健地擴展極深的Transformer

BranchNorm: Robustly Scaling Extremely Deep Transformers

摘要

Support