ChatPaper.aiChatPaper

Frac-Connections:超連接的分數擴展

Frac-Connections: Fractional Extension of Hyper-Connections

March 18, 2025
作者: Defa Zhu, Hongzhi Huang, Jundong Zhou, Zihao Huang, Yutao Zeng, Banggu Wu, Qiyang Min, Xun Zhou
cs.AI

摘要

殘差連接是現代深度學習架構的核心,通過緩解梯度消失問題,使得訓練極深層網絡成為可能。超連接技術近期對殘差連接進行了推廣,引入了不同深度的多種連接強度,從而解決了梯度消失與表示崩塌之間的蹺蹺板效應。然而,超連接通過擴展隱藏狀態的寬度增加了記憶體存取成本。本文提出了一種新方法——分數連接,該方法將隱藏狀態劃分為多個部分而非擴展其寬度。分數連接在保留超連接部分優勢的同時,降低了記憶體消耗。為驗證其有效性,我們在語言任務上進行了大規模實驗,其中最大規模的實驗是在高達3T詞元上訓練的7B MoE模型,結果表明分數連接顯著優於傳統的殘差連接。
English
Residual connections are central to modern deep learning architectures, enabling the training of very deep networks by mitigating gradient vanishing. Hyper-Connections recently generalized residual connections by introducing multiple connection strengths at different depths, thereby addressing the seesaw effect between gradient vanishing and representation collapse. However, Hyper-Connections increase memory access costs by expanding the width of hidden states. In this paper, we propose Frac-Connections, a novel approach that divides hidden states into multiple parts rather than expanding their width. Frac-Connections retain partial benefits of Hyper-Connections while reducing memory consumption. To validate their effectiveness, we conduct large-scale experiments on language tasks, with the largest being a 7B MoE model trained on up to 3T tokens, demonstrating that Frac-Connections significantly outperform residual connections.

Summary

AI-Generated Summary

PDF214March 19, 2025