从上下文去噪视角重新审视长上下文建模

摘要

长上下文模型（LCMs）在处理长序列方面展现了巨大潜力，推动了众多实际应用的发展。LCMs的成功可归因于其能够在上下文中定位隐含的关键信息以进行后续预测。然而，近期研究表明，LCMs往往容易受到上下文噪声（即无关的标记）的影响，这些噪声可能会误导模型的注意力。本文对上下文噪声进行了细致分析，并提出了一种有效的度量指标——积分梯度（IG）分数，用于检测和量化上下文中的噪声信息。我们的研究发现，即使是对检测到的上下文噪声进行简单缓解，也能显著增强模型对关键标记的关注，从而有利于后续预测。基于这一洞察，我们提出了上下文去噪训练（CDT），这是一种简单而有效的训练策略，旨在提升对关键标记的注意力，同时强化它们对模型预测的影响。在上下文窗口扩展和长上下文对齐设置下的四项任务中，广泛的实验验证了CDT的优越性。值得注意的是，采用CDT训练后，一个开源的8B模型能够达到与GPT-4o（51.00）相当的性能（50.92）。

English

Long-context models (LCMs) have demonstrated great potential in processing long sequences, facilitating many real-world applications. The success of LCMs can be attributed to their ability to locate implicit critical information within the context for further prediction. However, recent research reveals that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens, that can mislead model attention. In this paper, we conduct a fine-grained analysis of the context noise and propose an effective metric, the Integrated Gradient (IG) score, to detect and quantify the noise information within the context. Our findings reveal that even simple mitigation of detected context noise can substantially boost the model's attention on critical tokens and benefit subsequent predictions. Building on this insight, we propose Context Denoising Training (CDT), a straightforward yet effective training strategy that improves attention on critical tokens while reinforcing their influence on model predictions. Extensive experiments across four tasks, under both context window scaling and long-context alignment settings, demonstrate the superiority of CDT. Notably, when trained with CDT, an open-source 8B model can achieve performance (50.92) comparable to GPT-4o (51.00).

从上下文去噪视角重新审视长上下文建模

Revisiting Long-context Modeling from Context Denoising Perspective

摘要

Support