VeLoRA: ランク1サブトークン射影を用いたメモリ効率の良いトレーニング

要旨

大規模言語モデル（LLM）は、最近多くの言語処理タスクに対処するための強力なツールとして登場しました。しかし、これらのモデルのトレーニングやファインチューニングは、依然として計算量とメモリ使用量が非常に大きいという課題があります。本論文では、勾配降下法を用いた効果的なモデル収束に必要な重要なコンポーネントを特定し、その特性を明らかにします。その過程で、誤差逆伝播法を実装するために使用される中間活性化が、性能の低下を招くことなく過度に圧縮可能であることを発見しました。この結果に基づき、LLMのファインチューニングと事前学習の両方において、コスト効率が高くメモリ効率の良いアルゴリズムを提案します。提案アルゴリズムは、フォワードパス中にトークンを小さなサブトークンに分割し、それらを固定された1次元部分空間に射影するというシンプルな手法です。その後、バックワードパス中にこれらの特徴を大まかに再構築し、更新ルールを実装します。本アルゴリズムの有効性を、VTAB-1kファインチューニングベンチマークにおいて、多くの最先端のPEFT手法と補完的であることを確認しました。さらに、LLaMAのファインチューニングにおいてQLoRAを上回り、大規模なC4データセットにおいて他のメモリ効率の良い事前学習手法と競合する性能を示しました。

English

Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.

VeLoRA: ランク1サブトークン射影を用いたメモリ効率の良いトレーニング

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

要旨

Summary

Support

Support