超越探索与利用的权衡：面向RLVR中LLM推理的隐状态方法

摘要

在可验证奖励的强化学习（RLVR）领域，主流观点通过探索-利用权衡的视角解读最新进展，这一视角主要受限于基于标记级别的评估指标。我们重新审视这一观点，提出这种感知到的权衡可能并非根本性约束，而是测量层次上的假象。为探究此问题，我们将分析转向语义丰富的隐藏状态空间，采用有效秩（ER）量化探索，并引入其新颖的一阶和二阶导数——有效秩速度（ERV）与有效秩加速度（ERA），以捕捉利用动态。我们的分析揭示，在隐藏状态层面，探索与利用能够实现解耦（见第4节）。这一发现揭示了同时提升两者能力的可能性。基于此洞见，我们提出了速度利用秩学习（VERL）方法，首次通过直接塑造RL优势函数，实现了探索与利用协同增强的原则。其核心创新在于利用理论稳定的ERA作为预测元控制器，构建了一个协同的双通道激励机制。VERL不强制进行权衡，而是前瞻性地放大探索奖励以预防过度自信，并强化利用收益以巩固推理能力。跨多种大语言模型和推理基准的实验显示了一致的性能提升，包括在具有挑战性的2024年高考数据集上实现了高达21.4%的绝对准确率提升。

English

A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

超越探索与利用的权衡：面向RLVR中LLM推理的隐状态方法

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

摘要

Support