以不同方式堆疊更多層:透過低秩更新進行高秩訓練
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
July 11, 2023
作者: Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky
cs.AI
摘要
儘管擴展在規模上的主導地位和有效性,導致具有數千億參數的大型網絡,但訓練過度參數化模型的必要性仍不被充分理解,並且替代方法未必能降低訓練高性能模型的成本。在本文中,我們探索低秩訓練技術作為訓練大型神經網絡的替代方法。我們引入一種名為ReLoRA的新方法,該方法利用低秩更新來訓練高秩網絡。我們將ReLoRA應用於具有多達3.5億參數的預訓練變壓器語言模型,並展示其與常規神經網絡訓練相當的性能。此外,我們觀察到ReLoRA的效率隨著模型大小的增加而提高,使其成為高效訓練數十億參數網絡的有前景方法。我們的研究結果揭示了低秩訓練技術的潛力及其對擴展定律的影響。
English
Despite the dominance and effectiveness of scaling, resulting in large
networks with hundreds of billions of parameters, the necessity to train
overparametrized models remains poorly understood, and alternative approaches
do not necessarily make it cheaper to train high-performance models. In this
paper, we explore low-rank training techniques as an alternative approach to
training large neural networks. We introduce a novel method called ReLoRA,
which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to
pre-training transformer language models with up to 350M parameters and
demonstrate comparable performance to regular neural network training.
Furthermore, we observe that the efficiency of ReLoRA increases with model
size, making it a promising approach for training multi-billion-parameter
networks efficiently. Our findings shed light on the potential of low-rank
training techniques and their implications for scaling laws.