Resonance RoPE: 대규모 언어 모델의 컨텍스트 길이 일반화 성능 향상

초록

본 논문은 Rotary Position Embedding(RoPE)을 장착한 대규모 언어 모델(LLM)에서의 짧은 학습-긴 테스트(TSTL) 시나리오의 문제를 다룹니다. 이 시나리오에서는 짧은 시퀀스로 사전 학습된 모델이 긴 시퀀스에서의 분포 외(OOD) 토큰 위치를 처리하는 데 어려움을 겪습니다. 우리는 Resonance RoPE라는 새로운 접근 방식을 소개하며, 이는 OOD 위치에 대한 RoPE 특징의 보간을 개선하여 TSTL 시나리오에서의 일반화 격차를 줄이고, 추가적인 온라인 계산 비용 없이 모델 성능을 크게 향상시킵니다. 또한, PosGen이라는 새로운 합성 벤치마크를 제시하여, TSTL 시나리오에서의 세밀한 행동 분석을 위해 설계되었으며, 긴 문맥에서의 토큰 생성의 지속적으로 증가하는 어려움을 새로운 토큰 위치 인식의 문제와 분리하는 것을 목표로 합니다. 합성 작업에 대한 실험 결과, Resonance RoPE를 적용한 후 트랜스포머가 OOD 위치를 더 잘 그리고 더 강인하게 인식하는 것을 보여줍니다. 또한, 다양한 LLM 실험에서도 Resonance RoPE를 최신 RoPE 스케일링 방법인 YaRN에 적용한 후 상류 언어 모델링 작업과 다양한 하류 장문 응용 프로그램에서 우수한 성능을 보여줍니다.

English

This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.

Resonance RoPE: 대규모 언어 모델의 컨텍스트 길이 일반화 성능 향상

Resonance RoPE: Improving Context Length Generalization of Large Language Models

초록

Support