思之弥久,探之愈深:基于长度激励强化学习的上下文探索方法
Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning
February 12, 2026
作者: Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu Cheng, Tao Lin
cs.AI
摘要
要实现有效的测试时扩展,模型需具备情境探索能力——即在单一连续语境中生成、验证并优化多重推理假设的内在能力。基于状态覆盖理论的分析发现,实现该能力存在关键瓶颈:虽然更广的状态覆盖需要更长的推理轨迹,但在自回归生成过程中采样此类序列的概率会呈指数级衰减,这一现象被我们称为"浅层探索陷阱"。
为突破此局限,我们提出长度激励探索法。该方案通过长度奖励与冗余惩罚相结合的简单而有效的机制,显式激励模型进行更广泛探索,从而以双阶段方式实现状态覆盖最大化。跨模型(Qwen3、Llama)的综合实验表明,本方法能有效促进情境探索能力。实验结果显示,该方法在领域内任务上平均提升4.4%,在领域外基准测试中获得2.7%的性能增益。
English
Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context.
Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''.
To bridge this gap, we propose Length-Incentivized Exploration(\method).
This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner.
Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration.
As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.