100개의 샘플로 LLM의 컨텍스트 윈도우 확장하기

초록

대형 언어 모델(LLM)은 사전 학습된 컨텍스트 윈도우를 넘어서는 외삽 능력이 제한되어 있어, 긴 입력을 요구하는 다운스트림 작업에서의 적용이 제약되는 것으로 알려져 있습니다. 최근 연구들은 LLaMA, PaLM, GPT-NeoX와 같은 유명한 LLM들이 채택한 인기 있는 위치 인코딩 방법인 회전 위치 임베딩(RoPE)을 수정하여 LLM의 컨텍스트 윈도우를 확장하려는 시도를 해왔습니다. 그러나 Position Interpolation(PI) 및 YaRN과 같은 기존 연구들은 자원 집약적이며, 그 적용 가능성을 평가하기 위한 비교 실험이 부족합니다. 본 연구에서는 LLM의 어텐션 엔트로피(즉, 어텐션 점수의 정보 엔트로피)가 안정성을 유지해야 하는 본질적인 필요성을 확인하고, RoPE의 기본 주파수를 조정하고 어텐션 로짓을 스케일링하여 LLM이 더 큰 컨텍스트 윈도우에 효율적으로 적응할 수 있도록 돕는 RoPE의 새로운 확장 방법을 소개합니다. 우리는 다양한 컨텍스트 요구 작업에서의 미세 조정 성능과 다양한 컨텍스트 윈도우 크기에 대한 견고성을 통해 우리 방법의 우수성을 검증합니다. 특히, 우리의 방법은 LLaMA-2-7B-Chat의 컨텍스트 윈도우를 단 100개의 샘플과 6번의 학습 단계만으로 16,384로 확장하며, 탁월한 효율성을 보여줍니다. 마지막으로, 특정 다운스트림 작업에 대한 컨텍스트 윈도우 확장에 데이터 구성과 학습 커리큘럼이 어떻게 영향을 미치는지 탐구하며, 긴 대화로 LLM을 미세 조정하는 것이 좋은 출발점임을 제안합니다. 우리는 코드와 SFT 데이터를 https://github.com/GAIR-NLP/Entropy-ABF에서 공개합니다.

English

Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window, constraining their application in downstream tasks with lengthy inputs. Recent studies have sought to extend LLMs' context window by modifying rotary position embedding (RoPE), a popular position encoding method adopted by well-known LLMs such as LLaMA, PaLM, and GPT-NeoX. However, prior works like Position Interpolation (PI) and YaRN are resource-intensive and lack comparative experiments to assess their applicability. In this work, we identify the inherent need for LLMs' attention entropy (i.e. the information entropy of attention scores) to maintain stability and introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window. We validate the superiority of our method in both fine-tuning performance and robustness across different context window sizes on various context-demanding tasks. Notably, our method extends the context window of LLaMA-2-7B-Chat to 16,384 with only 100 samples and 6 training steps, showcasing extraordinary efficiency. Finally, we also explore how data compositions and training curricula affect context window extension for specific downstream tasks, suggesting fine-tuning LLMs with lengthy conversations as a good starting point. We release our code and SFT data at https://github.com/GAIR-NLP/Entropy-ABF.

100개의 샘플로 LLM의 컨텍스트 윈도우 확장하기

Extending LLMs' Context Window with 100 Samples

초록

Support