具有線上子空間下降的記憶效率LLM訓練

摘要

最近，一系列節省記憶體的低秩梯度下降（LLM）訓練算法廣受歡迎。這些方法利用梯度的低秩結構，將優化器狀態投影到一個子空間中，使用奇異值分解（SVD）找到的投影矩陣。然而，這些算法的收斂高度依賴於其投影矩陣的更新規則。在這項工作中，我們為投影矩陣的任意更新規則提供了首個收斂保證。這個保證通常適用於可以用哈密頓下降分析的優化器，包括最常見的優化器，如LION、Adam等。受我們對理論的理解啟發，我們提出了在沒有SVD的情況下更新投影矩陣的新型子空間下降優化器——在線子空間下降。在線子空間下降不是通過更新特徵向量來更新投影矩陣，而是通過在線主成分分析（PCA）來更新投影矩陣。在線子空間下降靈活且僅對訓練帶來最小的額外開銷。我們展示，在C4數據集上對包含從60M到7B參數的LLaMA模型進行預訓練的任務中，相較於最先進的低秩訓練方法，在不同設置下，以及縮小與全秩基準之間差距，在線子空間下降實現了更低的困惑度和更好的下游任務性能。

English

Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the first convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with online PCA. Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.

具有線上子空間下降的記憶效率LLM訓練

Memory-Efficient LLM Training with Online Subspace Descent

摘要

Support