スライディングウィンドウ注意機構の適応

要旨

Transformerベースの大規模言語モデル（LLM）における自己注意機構は、入力長に対して二次関数的に計算コストが増大するため、長文コンテキストの推論には高いコストが伴います。スライディングウィンドウ注意（SWA）はこのコストを線形計算量に削減しますが、完全注意（FA）で事前学習されたモデルにおいて、推論時に単純にSWAを適用すると、学習と推論の不一致から長文コンテキスト性能が大幅に低下します。ここで疑問が生じます：FAで事前学習されたLLMを再事前学習なしにSWAへ適応させることは可能か？本研究では、スライディングウィンドウ注意適応（SWAA）を提案し、この課題を検証します。SWAAは、より良い適応のための5つの手法を組み合わせた実践的なレシピです：（1）プリフィリング段階でのみSWAを適用、（2）「シンク」トークンの保持、（3）FA/SWA層の交互配置、（4）連鎖思考（CoT）、（5）ファインチューニング。実験結果から、SWAへの適応は非自明ながら実現可能であることが示されました：単一手法では不十分であるものの、特定の相乗効果を持つ組み合わせにより、元の長文コンテキスト性能を効果的に回復できます。さらに、様々なSWAA構成における性能と効率性のトレードオフを分析し、多様なシナリオに対応した推奨レシピを提供します。コードはhttps://github.com/yuyijiong/sliding-window-attention-adaptationで公開しています。

English

The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation