從上下文去噪視角重新審視長上下文建模

摘要

長上下文模型（LCMs）在處理長序列方面展現了巨大潛力，促進了許多實際應用的發展。LCMs的成功可歸因於其能在上下文中定位隱含的關鍵信息，以進行進一步的預測。然而，近期研究揭示，LCMs往往容易受到上下文噪音（即不相關的標記）的影響，這些噪音可能誤導模型的注意力。本文對上下文噪音進行了細緻分析，並提出了一種有效的度量方法——積分梯度（IG）分數，以檢測並量化上下文中的噪音信息。我們的研究發現，即使簡單地緩解檢測到的上下文噪音，也能顯著提升模型對關鍵標記的注意力，並有益於後續的預測。基於這一洞察，我們提出了上下文去噪訓練（CDT），這是一種簡單而有效的訓練策略，旨在提高對關鍵標記的注意力，同時強化其對模型預測的影響。在上下文窗口擴展和長上下文對齊設置下的四項任務中，廣泛的實驗證明了CDT的優越性。值得注意的是，當使用CDT訓練時，一個開源的8B模型能夠達到與GPT-4o（51.00）相當的性能（50.92）。

English

Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).

從上下文去噪視角重新審視長上下文建模

Revisiting Long-context Modeling from Context Denoising Perspective

摘要

Support