以不同方式堆疊更多層：透過低秩更新進行高秩訓練

摘要

儘管擴展在規模上的主導地位和有效性，導致具有數千億參數的大型網絡，但訓練過度參數化模型的必要性仍不被充分理解，並且替代方法未必能降低訓練高性能模型的成本。在本文中，我們探索低秩訓練技術作為訓練大型神經網絡的替代方法。我們引入一種名為ReLoRA的新方法，該方法利用低秩更新來訓練高秩網絡。我們將ReLoRA應用於具有多達3.5億參數的預訓練變壓器語言模型，並展示其與常規神經網絡訓練相當的性能。此外，我們觀察到ReLoRA的效率隨著模型大小的增加而提高，使其成為高效訓練數十億參數網絡的有前景方法。我們的研究結果揭示了低秩訓練技術的潛力及其對擴展定律的影響。

English

Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, and alternative approaches do not necessarily make it cheaper to train high-performance models. In this paper, we explore low-rank training techniques as an alternative approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to pre-training transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training. Furthermore, we observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multi-billion-parameter networks efficiently. Our findings shed light on the potential of low-rank training techniques and their implications for scaling laws.

以不同方式堆疊更多層：透過低秩更新進行高秩訓練

Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

摘要

Support