LongRoPE2: 거의 손실 없는 LLM 컨텍스트 윈도우 확장

초록

LongRoPE2은 사전 훈련된 대규모 언어 모델(LLM)의 효과적인 컨텍스트 윈도우를 목표 길이로 확장하면서도 원래의 짧은 컨텍스트 윈도우에서의 성능을 유지하는 새로운 접근 방식입니다. 이는 세 가지 주요 기여를 통해 달성됩니다: (1) 기존 방법에서 관찰된 지속적인 분포 외(OOD) 문제가 높은 RoPE 차원에서의 불충분한 훈련에 기인한다는 가설; (2) 불충분한 훈련 문제를 해결하기 위해 "바늘 주도" 퍼플렉서티에 의해 안내된 진화적 탐색을 채택한 효과적인 RoPE 재조정 알고리즘; (3) 긴 컨텍스트 시퀀스에 대해 재조정된 RoPE를 적용하면서도 원래의 RoPE를 사용하여 짧은 컨텍스트 성능을 유지하는 혼합 컨텍스트 윈도우 훈련 접근법. LLaMA3-8B와 Phi3-mini-3.8B를 다양한 벤치마크에서 진행한 광범위한 실험을 통해 이 가설을 검증하고 LongRoPE2의 효과성을 입증했습니다. 특히, LongRoPE2은 LLaMA3-8B의 효과적인 컨텍스트 길이를 128K로 확장하면서도 짧은 컨텍스트 성능의 98.5% 이상을 유지하며, 단 10B 토큰만 사용했습니다. 이는 메타의 접근 방식보다 80배 적은 토큰 수로, 메타의 방법은 목표한 효과적인 컨텍스트 길이에 도달하지 못했습니다. 코드는 https://github.com/microsoft/LongRoPE에서 제공될 예정입니다.

English

LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.

LongRoPE2: 거의 손실 없는 LLM 컨텍스트 윈도우 확장

LongRoPE2: Near-Lossless LLM Context Window Scaling

초록

Support