具有線上子空間下降的記憶效率LLM訓練
Memory-Efficient LLM Training with Online Subspace Descent
August 23, 2024
作者: Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu
cs.AI
摘要
最近,一系列節省記憶體的低秩梯度下降(LLM)訓練算法廣受歡迎。這些方法利用梯度的低秩結構,將優化器狀態投影到一個子空間中,使用奇異值分解(SVD)找到的投影矩陣。然而,這些算法的收斂高度依賴於其投影矩陣的更新規則。在這項工作中,我們為投影矩陣的任意更新規則提供了首個收斂保證。這個保證通常適用於可以用哈密頓下降分析的優化器,包括最常見的優化器,如LION、Adam等。受我們對理論的理解啟發,我們提出了在沒有SVD的情況下更新投影矩陣的新型子空間下降優化器——在線子空間下降。在線子空間下降不是通過更新特徵向量來更新投影矩陣,而是通過在線主成分分析(PCA)來更新投影矩陣。在線子空間下降靈活且僅對訓練帶來最小的額外開銷。我們展示,在C4數據集上對包含從60M到7B參數的LLaMA模型進行預訓練的任務中,相較於最先進的低秩訓練方法,在不同設置下,以及縮小與全秩基準之間差距, 在線子空間下降實現了更低的困惑度和更好的下游任務性能。
English
Recently, a wide range of memory-efficient LLM training algorithms have
gained substantial popularity. These methods leverage the low-rank structure of
gradients to project optimizer states into a subspace using projection matrix
found by singular value decomposition (SVD). However, convergence of these
algorithms is highly dependent on the update rules of their projection matrix.
In this work, we provide the first convergence guarantee for arbitrary
update rules of projection matrix. This guarantee is generally applicable to
optimizers that can be analyzed with Hamiltonian Descent, including most common
ones, such as LION, Adam. Inspired by our theoretical understanding, we propose
Online Subspace Descent, a new family of subspace descent optimizer without
SVD. Instead of updating the projection matrix with eigenvectors, Online
Subspace Descent updates the projection matrix with online PCA. Online Subspace
Descent is flexible and introduces only minimum overhead to training. We show
that for the task of pretraining LLaMA models ranging from 60M to 7B parameters
on the C4 dataset, Online Subspace Descent achieves lower perplexity and better
downstream tasks performance than state-of-the-art low-rank training methods
across different settings and narrows the gap with full-rank baselines.Summary
AI-Generated Summary