LongLoRA: 장문맥 대규모 언어 모델의 효율적 미세 조정

초록

본 논문에서는 사전 학습된 대규모 언어 모델(LLM)의 컨텍스트 크기를 제한된 계산 비용으로 확장하는 효율적인 미세 조정 접근법인 LongLoRA를 제안한다. 일반적으로 긴 컨텍스트 크기로 LLM을 학습시키는 것은 계산 비용이 많이 들며, 많은 학습 시간과 GPU 자원을 필요로 한다. 예를 들어, 컨텍스트 길이 8192로 학습하는 경우, 셀프 어텐션 레이어에서 2048에 비해 16배의 계산 비용이 소요된다. 본 논문에서는 두 가지 측면에서 LLM의 컨텍스트 확장 속도를 높인다. 한편으로, 추론 시에는 밀집된 전역 어텐션이 필요하지만, 모델의 미세 조정은 희소한 지역 어텐션을 통해 효과적이고 효율적으로 수행될 수 있다. 제안된 시프트 짧은 어텐션은 컨텍스트 확장을 효과적으로 가능하게 하며, 기존의 밀집 어텐션을 사용한 미세 조정과 유사한 성능을 유지하면서도 상당한 계산 비용 절감을 이끌어낸다. 특히, 이는 학습 시 단 두 줄의 코드로 구현할 수 있으며, 추론 시에는 선택적으로 사용할 수 있다. 다른 한편으로, 컨텍스트 확장을 위한 파라미터 효율적 미세 조정 체계를 재검토한다. 특히, 학습 가능한 임베딩과 정규화를 전제로 할 때, 컨텍스트 확장을 위한 LoRA가 잘 작동함을 발견했다. LongLoRA는 7B/13B에서 70B에 이르는 LLaMA2 모델에서 다양한 작업에서 강력한 실험 결과를 보여준다. LongLoRA는 LLaMA2 7B를 4k 컨텍스트에서 100k로, 또는 LLaMA2 70B를 32k로 단일 8x A100 머신에서 확장한다. LongLoRA는 모델의 원래 아키텍처를 유지하면서 컨텍스트를 확장하며, FlashAttention-2와 같은 대부분의 기존 기술과 호환된다. 또한, LongLoRA를 실용적으로 만들기 위해, 지도 미세 조정을 위한 데이터셋인 LongQA를 수집했다. 이 데이터셋은 3천 개 이상의 긴 컨텍스트 질문-답변 쌍을 포함한다.

English

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.

LongLoRA: 장문맥 대규모 언어 모델의 효율적 미세 조정

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

초록

Support