VeLoRA: 랭크-1 서브 토큰 프로젝션을 활용한 메모리 효율적 학습

초록

대규모 언어 모델(LLM)은 최근 다양한 언어 처리 과제를 해결하는 강력한 도구로 부상했습니다. 그러나 이러한 모델의 학습과 미세 조정은 여전히 계산 및 메모리 측면에서 매우 부담스러운 작업입니다. 본 논문에서는 경사 하강법을 사용하여 효과적인 모델 수렴을 위해 필요한 중요한 구성 요소를 식별하고 특성화합니다. 이를 통해 역전파를 구현하는 데 사용되는 중간 활성화가 성능 저하 없이 과도하게 압축될 수 있음을 발견했습니다. 이러한 결과는 LLM의 미세 조정과 사전 학습 모두에 적합한 저비용 및 메모리 효율적인 알고리즘으로 이어졌습니다. 제안된 알고리즘은 순전파 과정에서 토큰을 더 작은 하위 토큰으로 분할한 후 고정된 1차원 부분 공간에 투영하는 방식으로 작동합니다. 이후 역전파 과정에서 이러한 특징을 대략적으로 재구성하여 업데이트 규칙을 구현합니다. 우리는 VTAB-1k 미세 조정 벤치마크에서 최신 PEFT(Parameter-Efficient Fine-Tuning) 방법들과의 상호 보완성을 통해 알고리즘의 효과를 확인했습니다. 또한, LLaMA 모델의 미세 조정에서 QLoRA를 능가하는 성능을 보였으며, 대규모 C4 데이터셋에서 다른 메모리 효율적인 사전 학습 방법들과 경쟁력 있는 성능을 입증했습니다.

English

Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.

VeLoRA: 랭크-1 서브 토큰 프로젝션을 활용한 메모리 효율적 학습

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

초록

Support