長文脈モデリングの再考：文脈ノイズ除去の観点から

要旨

長文脈モデル（LCM）は、長いシーケンスを処理する際に大きな可能性を示し、多くの実世界の応用を促進してきた。LCMの成功は、文脈内の暗黙的な重要情報を特定し、それに基づいて予測を行う能力に起因している。しかし、最近の研究によると、LCMはしばしば文脈ノイズ、すなわち無関係なトークンに影響を受けやすく、これがモデルの注意を誤った方向に導くことが明らかになっている。本論文では、文脈ノイズを詳細に分析し、文脈内のノイズ情報を検出・定量化するための有効な指標として、統合勾配（IG）スコアを提案する。我々の研究結果は、検出された文脈ノイズを単純に軽減するだけで、モデルの重要トークンへの注意が大幅に向上し、その後の予測に有益であることを示している。この知見に基づき、重要トークンへの注意を向上させ、それらのモデル予測への影響を強化する、シンプルでありながら効果的なトレーニング戦略である文脈ノイズ除去トレーニング（CDT）を提案する。文脈ウィンドウのスケーリングと長文脈アライメントの両設定下での4つのタスクにわたる広範な実験により、CDTの優位性が実証された。特に、CDTでトレーニングされたオープンソースの8Bモデルは、GPT-4o（51.00）に匹敵する性能（50.92）を達成することができる。

English

Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).

長文脈モデリングの再考：文脈ノイズ除去の観点から

Revisiting Long-context Modeling from Context Denoising Perspective

要旨

Support