从 GaLore 到 WeLore：低秩权重如何非均匀地从低秩梯度中出现

摘要

现代大型语言模型（LLMs）由拥有数十亿元素的矩阵组成，这使得它们在计算资源和内存使用方面要求相当高。由于这些矩阵非常庞大，因此通常可以用低秩格式来表示，从而放宽资源需求。与以往侧重于开发新型矩阵分解算法的研究不同，在这项工作中，我们首先研究了LLMs不同层中矩阵之间低秩结构的出现，并建立了梯度动态与矩阵低秩表达能力之间的关系。我们的研究发现，不同层展现出不同程度的收敛低秩结构，需要在它们之间进行非均匀的秩降以最小化由压缩导致的性能下降。基于此，我们提出了Weight Low-Rank Projection（WeLore），将权重压缩和内存高效微调统一为一个数据无关且一次完成的过程。WeLore利用奇异值的重尾分布来确定LLMs中矩阵的适当秩降比率。WeLore不仅仅是一种压缩技术，还根据其作为低秩表达的能力，将权重矩阵分类为低秩组件（LRCs）和非低秩组件（N-LRCs）。我们的梯度视角和大量实验表明，LRCs往往具有更好的微调能力，并且可以紧密模拟（有时甚至优于）完全微调的训练损失轨迹和性能，同时显著减少内存和计算资源占用。例如，仅使用LRCs中一小部分参数（WeLore）对50\%压缩的LLaMa-27B模型进行微调，可以比全面微调获得约3倍更好的吞吐量和约0.6倍的GPU需求。我们的代码可在https://github.com/VITA-Group/welore 上获取。

English

Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algorithms, in this work we first study the emergence of low-rank structures across matrices within different layers of LLMs and establish a consequential relationship between the gradient dynamics and emerging low-rank expressiveness of matrices. Our findings reveal that different layers exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. WeLore capitalizes the heavy-tail distribution of singular values to identify a suitable rank reduction ratio for matrices within LLMs. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient perspective and extensive experiments illustrate that LRCs tend to have better finetuning capabilities and can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. For example, finetuning a 50\% compressed LLaMa-2 7B model using only a fraction of parameters in LRCs (WeLore) can outperform its full finetuning with ~3x better throughput and ~0.6x GPU requirement. Our codes are available at https://github.com/VITA-Group/welore

从 GaLore 到 WeLore：低秩权重如何非均匀地从低秩梯度中出现

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

摘要

Support