VeLoRA:使用排名-1子詞投影的記憶效率訓練
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections
May 28, 2024
作者: Roy Miles, Pradyumna Reddy, Ismail Elezi, Jiankang Deng
cs.AI
摘要
近年來,大型語言模型(LLMs)已成為處理許多語言處理任務的強大工具。儘管取得成功,但訓練和微調這些模型仍然需要過多的計算和記憶體資源。本文中,我們確認並描述了實現梯度下降有效模型收斂所需的重要組件。在這個過程中,我們發現用於實現反向傳播的中間激活可以在不降低性能的情況下被過度壓縮。這個結果使我們提出了一種廉價且節省記憶體的算法,可用於LLMs的微調和預訓練。所提出的算法簡單地將標記分成較小的子標記,然後在前向傳播期間將它們投影到固定的一維子空間上。這些特徵然後在反向傳播期間粗略地重建以實現更新規則。我們確認我們的算法在VTAB-1k微調基準測試中作為許多最先進的PEFT方法的補充是有效的。此外,我們在LLaMA的微調中勝過QLoRA,並在大規模C4數據集上展現與其他節省記憶體的預訓練方法競爭性表現。
English
Large language models (LLMs) have recently emerged as powerful tools for
tackling many language-processing tasks. Despite their success, training and
fine-tuning these models is still far too computationally and memory intensive.
In this paper, we identify and characterise the important components needed for
effective model convergence using gradient descent. In doing so we find that
the intermediate activations used to implement backpropagation can be
excessively compressed without incurring any degradation in performance. This
result leads us to a cheap and memory-efficient algorithm for both fine-tuning
and pre-training LLMs. The proposed algorithm simply divides the tokens up into
smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace
during the forward pass. These features are then coarsely reconstructed during
the backward pass to implement the update rules. We confirm the effectiveness
of our algorithm as being complimentary to many state-of-the-art PEFT methods
on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for
fine-tuning LLaMA and show competitive performance against other
memory-efficient pre-training methods on the large-scale C4 dataset.Summary
AI-Generated Summary