长程激活模式：利用内在激活机制实现长上下文强化学习

摘要

强化学习（RL）已成为提升大语言模型（LLMs）推理能力的关键驱动力。尽管近期研究多聚焦于奖励工程或数据合成，但鲜有工作利用模型内在的表征特性来指导训练过程。本文首先观察到模型在处理长上下文时，查询向量与键向量中存在高幅值激活现象。受模型量化（此类高幅值激活被证明具有关键作用）的启发，并结合长上下文推理本身具有稀疏结构的洞见，我们假设这些权重是驱动模型有效优化的核心要素。基于此，我们提出LongAct策略——将均匀参数更新转变为显著性引导的稀疏更新。通过仅选择性更新与重要激活相关的权重，LongAct在LongBench v2上实现了约8%的性能提升，并在RULER基准测试中表现出更强的泛化能力。此外，本方法展现出显著的普适性，在GRPO、DAPO等多种RL算法中均能持续提升性能。大量消融实验表明，聚焦这些显著性特征是释放长上下文潜力的关键。

English

Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.