GaLore에서 WeLore로: 저랭크 그래디언트가 어떻게 비균일적으로 저랭크 가중치를 생성하는가

초록

현대의 대형 언어 모델(LLMs)은 수십억 개의 요소로 구성된 행렬들로 이루어져 있어, 이들의 저장 및 처리는 계산 자원과 메모리 사용 측면에서 상당히 요구적입니다. 이러한 대규모 행렬은 종종 저랭크(low-rank) 형식으로 표현될 수 있어 자원 요구 사항을 완화할 가능성이 있습니다. 기존 연구들이 새로운 행렬 분해 알고리즘 개발에 초점을 맞췄던 것과 달리, 본 연구에서는 먼저 LLM의 다양한 계층 내 행렬들에서 저랭크 구조의 출현을 연구하고, 그래디언트 역학과 행렬의 저랭크 표현성 간의 중요한 관계를 규명합니다. 우리의 연구 결과는 서로 다른 계층들이 다양한 수준의 저랭크 구조로 수렴함을 보여주며, 압축으로 인한 성능 저하를 최소화하기 위해 이들 간에 비균일한 랭크 감소가 필요함을 시사합니다. 이를 고려하여, 우리는 가중치 압축과 메모리 효율적인 미세 조정을 데이터에 구애받지 않고 일회성으로 통합한 Weight Low-Rank Projection(WeLore)을 제안합니다. WeLore은 특이값의 헤비테일 분포를 활용하여 LLM 내 행렬들에 적합한 랭크 감소 비율을 식별합니다. 단순한 압축 기술을 넘어, WeLore은 가중치 행렬을 저랭크로 표현할 수 있는 능력에 따라 저랭크 성분(LRCs)과 비저랭크 성분(N-LRCs)으로 분류합니다. 우리의 그래디언트 관점과 광범위한 실험은 LRCs가 더 나은 미세 조정 능력을 가지며, 전체 미세 조정의 학습 손실 궤적과 성능을 밀접하게 모방(때로는 능가)할 수 있음을 보여줍니다. 이는 메모리와 계산 비용을 크게 줄이면서 가능합니다. 예를 들어, LLaMa-2 7B 모델을 50% 압축한 상태에서 LRCs의 일부 매개변수만을 사용하여 미세 조정(WeLore)을 수행하면, 전체 미세 조정을 능가하면서도 처리량은 약 3배 향상되고 GPU 요구량은 약 0.6배로 감소합니다. 우리의 코드는 https://github.com/VITA-Group/welore에서 확인할 수 있습니다.

English

Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algorithms, in this work we first study the emergence of low-rank structures across matrices within different layers of LLMs and establish a consequential relationship between the gradient dynamics and emerging low-rank expressiveness of matrices. Our findings reveal that different layers exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. WeLore capitalizes the heavy-tail distribution of singular values to identify a suitable rank reduction ratio for matrices within LLMs. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient perspective and extensive experiments illustrate that LRCs tend to have better finetuning capabilities and can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. For example, finetuning a 50\% compressed LLaMa-2 7B model using only a fraction of parameters in LRCs (WeLore) can outperform its full finetuning with ~3x better throughput and ~0.6x GPU requirement. Our codes are available at https://github.com/VITA-Group/welore

GaLore에서 WeLore로: 저랭크 그래디언트가 어떻게 비균일적으로 저랭크 가중치를 생성하는가

From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients

초록

Support