從 GaLore 到 WeLore：低秩權重如何非均勻地從低秩梯度中浮現

摘要

現代的大型語言模型（LLMs）由數十億個元素組成的矩陣組成，這使得它們在計算資源和記憶體使用方面要求相當高。由於這些矩陣非常龐大，通常可以以低秩格式表示，從而放寬資源需求。與先前專注於開發新型矩陣分解算法的研究不同，在本研究中，我們首先研究了LLMs不同層中矩陣之間低秩結構的出現，並建立了梯度動態與矩陣低秩表現之間的相互關係。我們的研究發現不同層展現出不同程度的收斂低秩結構，需要在它們之間進行非均勻的秩降低，以減少由於壓縮而導致的性能下降。鑑於此，我們提出了Weight Low-Rank Projection（WeLore），將權重壓縮和記憶體高效微調統一為一體，以一種與數據無關且一次性的方式。WeLore利用奇異值的重尾分佈來識別LLMs中矩陣的適當秩降低比率。WeLore不僅僅是壓縮技術，還根據它們表現為低秩的能力，將權重矩陣分為低秩組件（LRCs）和非低秩組件（N-LRCs）。我們的梯度觀點和大量實驗表明，LRCs往往具有更好的微調能力，並且可以緊密模擬（有時甚至優於）完全微調的訓練損失軌跡和性能，同時顯著減少記憶體和計算占用。例如，僅使用LRCs中部分參數（WeLore）對50％壓縮的LLaMa-27B模型進行微調，可以實現比完全微調更高約3倍的吞吐量和約0.6倍的GPU需求。我們的程式碼可在以下網址找到：https://github.com/VITA-Group/welore

English

Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algorithms, in this work we first study the emergence of low-rank structures across matrices within different layers of LLMs and establish a consequential relationship between the gradient dynamics and emerging low-rank expressiveness of matrices. Our findings reveal that different layers exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. WeLore capitalizes the heavy-tail distribution of singular values to identify a suitable rank reduction ratio for matrices within LLMs. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient perspective and extensive experiments illustrate that LRCs tend to have better finetuning capabilities and can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. For example, finetuning a 50\% compressed LLaMa-2 7B model using only a fraction of parameters in LRCs (WeLore) can outperform its full finetuning with ~3x better throughput and ~0.6x GPU requirement. Our codes are available at https://github.com/VITA-Group/welore

從 GaLore 到 WeLore：低秩權重如何非均勻地從低秩梯度中浮現

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

摘要

Support