Resonance RoPE: 大規模言語モデルのコンテキスト長汎化性能の向上

要旨

本論文は、Rotary Position Embedding (RoPE)を備えた大規模言語モデル(LLM)における「短い系列で訓練し長い系列でテストする」(TSTL)シナリオの課題に取り組む。このシナリオでは、短い系列で事前学習されたモデルが、長い系列における分布外(OOD)のトークン位置に適応するのに困難を抱える。我々は、Resonance RoPEという新しいアプローチを提案する。これは、OOD位置におけるRoPE特徴量の補間を改良することでTSTLシナリオにおける汎化ギャップを狭め、追加のオンライン計算コストなしにモデル性能を大幅に向上させるものである。さらに、PosGenという新しい合成ベンチマークを提示する。これは、TSTLシナリオにおける細粒度の動作分析のために特別に設計されたもので、長い文脈におけるトークン生成の難易度の継続的な増加と、新しいトークン位置を認識する課題とを分離することを目的としている。合成タスクにおける実験では、Resonance RoPEを適用した後、TransformerがOOD位置をより良く、より堅牢に認識することを示す。また、大規模なLLM実験においても、Resonance RoPEを現在の最先端のRoPEスケーリング手法であるYaRNに適用した後、上流の言語モデリングタスクと多様な下流の長文テキストアプリケーションの両方で優れた性能を示すことを確認した。

English

This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.

Resonance RoPE: 大規模言語モデルのコンテキスト長汎化性能の向上

Resonance RoPE: Improving Context Length Generalization of Large Language Models

要旨

Support