ChatPaper.aiChatPaper

思之愈久,探之愈深:基于长度激励强化学习的上下文探索方法

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

February 12, 2026
作者: Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu Cheng, Tao Lin
cs.AI

摘要

实现有效的测试时扩展要求模型具备情境探索能力——即在单一连续语境中生成、验证并优化多重推理假设的内在能力。基于状态覆盖理论的分析揭示了一个关键瓶颈:虽然更广泛的状态覆盖需要更长的推理轨迹,但在自回归生成过程中,这类序列的采样概率会呈指数级衰减,这一现象被我们称为"浅层探索陷阱"。为突破此局限,我们提出长度激励探索方法。该方案通过长度奖励与冗余惩罚的显式结合,以简单而有效的方式激励模型进行更深入探索,从而以两步法实现状态覆盖最大化。跨模型(Qwen3、Llama)的综合实验表明,该方法能有效激发情境探索能力。实验结果显示,我们的方法在领域内任务上平均提升4.4%,在领域外基准测试中获得2.7%的性能增益。
English
Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.
PDF242February 14, 2026