VeLoRA：使用Rank-1子词投影的内存高效训练

摘要

大型语言模型（LLMs）最近已成为处理许多语言处理任务的强大工具。尽管取得了成功，但训练和微调这些模型仍然需要过多的计算和内存资源。在本文中，我们确定并描述了实现梯度下降有效模型收敛所需的重要组件。在这个过程中，我们发现用于实现反向传播的中间激活可以进行过度压缩，而不会降低性能。这一结果使我们提出了一种廉价且内存高效的算法，用于LLMs的微调和预训练。所提出的算法简单地将标记分成较小的子标记，然后在前向传递过程中将它们投影到一个固定的一维子空间上。这些特征在反向传递过程中被粗略重构，以实现更新规则。我们确认了我们的算法在VTAB-1k微调基准测试中作为许多最先进的PEFT方法的补充的有效性。此外，我们在LLaMA的微调中胜过了QLoRA，并在大规模C4数据集上展现了与其他内存高效的预训练方法竞争性能。

English

Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.

VeLoRA：使用Rank-1子词投影的内存高效训练

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

摘要

Summary

Support

Support