VeLoRA：使用排名-1子詞投影的記憶效率訓練

摘要

近年來，大型語言模型（LLMs）已成為處理許多語言處理任務的強大工具。儘管取得成功，但訓練和微調這些模型仍然需要過多的計算和記憶體資源。本文中，我們確認並描述了實現梯度下降有效模型收斂所需的重要組件。在這個過程中，我們發現用於實現反向傳播的中間激活可以在不降低性能的情況下被過度壓縮。這個結果使我們提出了一種廉價且節省記憶體的算法，可用於LLMs的微調和預訓練。所提出的算法簡單地將標記分成較小的子標記，然後在前向傳播期間將它們投影到固定的一維子空間上。這些特徵然後在反向傳播期間粗略地重建以實現更新規則。我們確認我們的算法在VTAB-1k微調基準測試中作為許多最先進的PEFT方法的補充是有效的。此外，我們在LLaMA的微調中勝過QLoRA，並在大規模C4數據集上展現與其他節省記憶體的預訓練方法競爭性表現。

English

Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.

VeLoRA：使用排名-1子詞投影的記憶效率訓練

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

摘要

Support