LongAct：利用内在激活模式实现长上下文强化学习

摘要

強化學習（RL）已成為提升大型語言模型（LLM）推理能力的關鍵驅動力。儘管近期研究多聚焦於獎勵工程或數據合成，但鮮有研究利用模型的內在表徵特性來指導訓練過程。本文首先觀察到，在處理長上下文時，查詢向量和鍵向量中會出現高幅值激活現象。受模型量化技術（該技術已證實此類高幅值激活的重要性）的啟發，並結合長上下文推理本身具有稀疏結構的特性，我們推斷這些權重是實現有效模型優化的關鍵驅動因素。基於此發現，我們提出LongAct策略——將均勻更新轉變為顯著性引導的稀疏更新。通過選擇性更新僅與重要激活相關的權重，LongAct在LongBench v2基準測試中實現約8%的性能提升，並在RULER基準上增強了泛化能力。此外，本方法展現出卓越的普適性，在GRPO、DAPO等多種RL算法中均能穩定提升性能。大量消融實驗表明，聚焦於這些顯著特徵是釋放長上下文處理潛力的核心關鍵。

English

Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.

LongAct：利用内在激活模式实现长上下文强化学习

LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

摘要

Support