GaLaReからWeLoreへ：低ランク勾配から非一様に出現する低ランク重みのメカニズム

要旨

現代の大規模言語モデル（LLM）は、数十億の要素からなる行列で構成されており、その保存と処理には計算リソースとメモリ使用量の面で多大な要求が伴います。これらの行列は非常に大規模であるため、低ランク形式で表現されることが多く、リソース要件を緩和する可能性があります。従来の研究が新しい行列分解アルゴリズムの開発に焦点を当てていたのに対し、本研究ではまず、LLMの異なる層内の行列にわたる低ランク構造の出現を調査し、勾配ダイナミクスと行列の低ランク表現力の間の因果関係を確立します。我々の調査結果は、異なる層がさまざまなレベルの収束した低ランク構造を示し、それらにわたる非一様なランク削減が圧縮による性能低下を最小限に抑えるために必要であることを明らかにしています。これに基づき、我々は重み圧縮とメモリ効率の良いファインチューニングをデータに依存せず、ワンショットで統一するWeight Low-Rank Projection（WeLore）を提案します。WeLoreは、特異値のヘビーテール分布を活用して、LLM内の行列に適したランク削減比率を特定します。単なる圧縮技術を超えて、WeLoreは重み行列を低ランク成分（LRCs）と非低ランク成分（N-LRCs）に分類し、それらが低ランクとして表現できる能力に基づいて分類します。我々の勾配視点と広範な実験は、LRCsがより良いファインチューニング能力を持ち、フルファインチューニングのトレーニング損失軌跡と性能を密接に模倣（時には上回る）し、顕著なメモリと計算フットプリントの削減を実現できることを示しています。例えば、LLaMa-2 7Bモデルの50％圧縮版をLRCs（WeLore）のパラメータの一部のみを使用してファインチューニングすると、フルファインチューニングを上回り、約3倍のスループットと約0.6倍のGPU要件で達成できます。我々のコードはhttps://github.com/VITA-Group/weloreで公開されています。

English

Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algorithms, in this work we first study the emergence of low-rank structures across matrices within different layers of LLMs and establish a consequential relationship between the gradient dynamics and emerging low-rank expressiveness of matrices. Our findings reveal that different layers exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. WeLore capitalizes the heavy-tail distribution of singular values to identify a suitable rank reduction ratio for matrices within LLMs. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient perspective and extensive experiments illustrate that LRCs tend to have better finetuning capabilities and can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. For example, finetuning a 50\% compressed LLaMa-2 7B model using only a fraction of parameters in LRCs (WeLore) can outperform its full finetuning with ~3x better throughput and ~0.6x GPU requirement. Our codes are available at https://github.com/VITA-Group/welore

GaLaReからWeLoreへ：低ランク勾配から非一様に出現する低ランク重みのメカニズム

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

要旨

Support