層を異なる方法で積み重ねる：低ランク更新による高ランクトレーニング

要旨

スケーリングの優位性と有効性により、数千億のパラメータを持つ大規模なネットワークが主流となっているにもかかわらず、過剰パラメータ化されたモデルを訓練する必要性については未だ十分に理解されておらず、代替的なアプローチが必ずしも高性能モデルの訓練コストを削減するわけではありません。本論文では、大規模ニューラルネットワークの訓練に対する代替アプローチとして、低ランク訓練技術を探求します。我々は、高ランクネットワークを訓練するために低ランク更新を利用する新たな手法「ReLoRA」を提案します。ReLoRAを最大3億5000万パラメータのTransformer言語モデルの事前学習に適用し、通常のニューラルネットワーク訓練と同等の性能を実現することを示します。さらに、ReLoRAの効率性はモデルサイズが大きくなるにつれて向上し、数十億パラメータのネットワークを効率的に訓練する有望なアプローチであることが観察されます。我々の知見は、低ランク訓練技術の可能性とスケーリング則への示唆に光を当てるものです。

English

Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, and alternative approaches do not necessarily make it cheaper to train high-performance models. In this paper, we explore low-rank training techniques as an alternative approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to pre-training transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training. Furthermore, we observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multi-billion-parameter networks efficiently. Our findings shed light on the potential of low-rank training techniques and their implications for scaling laws.

層を異なる方法で積み重ねる：低ランク更新による高ランクトレーニング

Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

要旨

Support