LongLoRA：长上下文大语言模型的高效微调

摘要

我们提出了LongLoRA，这是一种高效的微调方法，可以扩展预训练大型语言模型（LLMs）的上下文大小，同时具有有限的计算成本。通常，使用较长的上下文大小训练LLMs在计算上是昂贵的，需要大量的训练时间和GPU资源。例如，对长度为8192的上下文进行训练需要自注意力层中的计算成本增加16倍，相比于长度为2048的情况。在本文中，我们从两个方面加快了LLMs上下文扩展的速度。一方面，虽然推理过程中需要密集的全局注意力，但通过稀疏的局部注意力来对模型进行微调可以实现高效和有效。提出的短移动注意力有效地实现了上下文扩展，实现了与使用原始注意力微调相似性能的显著计算节省。特别是，这可以在训练中仅用两行代码实现，而在推理中是可选的。另一方面，我们重新审视了用于上下文扩展的参数高效微调策略。值得注意的是，我们发现LoRA用于上下文扩展在可训练的嵌入和归一化前提下效果良好。LongLoRA在7B/13B到70B的LLaMA2模型上展示了强大的实证结果。LongLoRA将LLaMA2 7B从4k上下文扩展到100k，或将LLaMA2 70B扩展到32k，仅使用一台8x A100机器。LongLoRA扩展了模型的上下文，同时保留了其原始架构，并且与大多数现有技术兼容，如FlashAttention-2。此外，为了使LongLoRA更实用，我们收集了一个名为LongQA的数据集，用于监督微调。该数据集包含超过3k个长上下文问答对。

English

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.

LongLoRA：长上下文大语言模型的高效微调

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

摘要

Support