LongLoRA: 長文脈対応大規模言語モデルの効率的なファインチューニング

要旨

本論文では、事前学習済み大規模言語モデル（LLM）のコンテキストサイズを限定的な計算コストで拡張する効率的なファインチューニング手法、LongLoRAを提案する。通常、長いコンテキストサイズでのLLMの学習は計算コストが高く、長時間の学習と大量のGPUリソースを必要とする。例えば、コンテキスト長8192での学習は、2048の場合と比べてセルフアテンションレイヤーで16倍の計算コストを要する。本論文では、LLMのコンテキスト拡張を2つの側面から高速化する。一方では、推論時には密なグローバルアテンションが必要だが、モデルのファインチューニングは疎なローカルアテンションによって効率的かつ効果的に行うことができる。提案するシフトショートアテンションは、コンテキスト拡張を可能にし、バニラアテンションを用いたファインチューニングと同等の性能を維持しながら、計算コストを大幅に削減する。特に、学習時にはわずか2行のコードで実装可能であり、推論時にはオプションとして使用できる。他方では、コンテキスト拡張のためのパラメータ効率的なファインチューニング体制を再検討する。特に、埋め込みと正規化が学習可能であるという前提の下で、コンテキスト拡張のためのLoRAが良好に機能することを発見した。LongLoRAは、7B/13Bから70BまでのLLaMA2モデルにおいて、様々なタスクで強力な実証結果を示す。LongLoRAは、LLaMA2 7Bを4kコンテキストから100kに、またはLLaMA2 70Bを32kに、単一の8x A100マシンで拡張する。LongLoRAは、モデルのコンテキストを拡張しながら元のアーキテクチャを維持し、FlashAttention-2などの既存の技術との互換性がある。さらに、LongLoRAを実用的にするために、教師ありファインチューニング用のデータセット、LongQAを収集した。これには3,000以上の長いコンテキストの質問-回答ペアが含まれている。

English

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.

LongLoRA: 長文脈対応大規模言語モデルの効率的なファインチューニング

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

要旨

Support