使用在线子空间下降进行高效内存的LLM训练

摘要

最近，一系列内存高效的LLM训练算法备受青睐。这些方法利用梯度的低秩结构，通过奇异值分解（SVD）找到的投影矩阵将优化器状态投影到一个子空间中。然而，这些算法的收敛性很大程度上取决于它们的投影矩阵的更新规则。在这项工作中，我们首次为投影矩阵的任意更新规则提供了收敛性保证。这个保证通常适用于可以用哈密顿下降进行分析的优化器，包括大多数常见的优化器，如LION、Adam等。受我们对理论的理解启发，我们提出了在线子空间下降（Online Subspace Descent），这是一种新的无需SVD的子空间下降优化器系列。在线子空间下降不是通过更新特征向量来更新投影矩阵，而是通过在线PCA来更新投影矩阵。在线子空间下降灵活，并且对训练只引入最小的额外开销。我们展示，在C4数据集上预训练LLaMA模型（参数范围从6千万到70亿）的任务中，相比最先进的低秩训练方法，在线子空间下降在不同设置下实现了更低的困惑度和更好的下游任务性能，并缩小了与全秩基准之间的差距。

English

Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the first convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with online PCA. Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.

使用在线子空间下降进行高效内存的LLM训练

Memory-Efficient LLM Training with Online Subspace Descent

摘要

Support