LongLoRA：長內容大型語言模型的高效微調

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

September 21, 2023

作者: Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia

cs.AI

摘要

我們提出了LongLoRA，一種有效的微調方法，可擴展預訓練大型語言模型（LLMs）的上下文大小，並具有有限的計算成本。通常，使用長上下文大小訓練LLMs在計算上是昂貴的，需要大量的訓練時間和GPU資源。例如，對長度為8192的上下文進行訓練需要自注意力層的計算成本增加16倍，相較於2048。在本文中，我們從兩個方面加快了LLMs的上下文擴展。一方面，儘管推論過程中需要密集的全局注意力，但通過稀疏的局部注意力可以有效且高效地進行模型微調。所提出的轉移短注意力有效地實現了上下文擴展，實現了與使用基本注意力進行微調相似性能的非微不足道的計算節省。特別是，它可以在訓練中僅用兩行程式碼實現，並在推論時是可選的。另一方面，我們重新審視了用於上下文擴展的參數高效微調模式。值得注意的是，我們發現LoRA對於上下文擴展在可訓練的嵌入和歸一化的前提下運作良好。LongLoRA在從7B/13B到70B的LLaMA2模型上展示了強大的實證結果。LongLoRA將LLaMA2 7B從4k上下文擴展到100k，或將LLaMA2 70B擴展到32k在單個8x A100機器上。LongLoRA擴展了模型的上下文，同時保留其原始架構，並且與大多數現有技術兼容，如FlashAttention-2。此外，為了使LongLoRA更實用，我們收集了一個名為LongQA的數據集，用於監督式微調。該數據集包含超過3k個長上下文問答對。

English

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.