ChatPaper.aiChatPaper

超越探索与利用的权衡:面向RLVR中LLM推理的隐状态方法

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

September 28, 2025
作者: Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang
cs.AI

摘要

在可验证奖励的强化学习(RLVR)领域,主流观点通过探索-利用权衡的视角解读最新进展,这一视角主要受限于基于标记级别的评估指标。我们重新审视这一观点,提出这种感知到的权衡可能并非根本性约束,而是测量层次上的假象。为探究此问题,我们将分析转向语义丰富的隐藏状态空间,采用有效秩(ER)量化探索,并引入其新颖的一阶和二阶导数——有效秩速度(ERV)与有效秩加速度(ERA),以捕捉利用动态。我们的分析揭示,在隐藏状态层面,探索与利用能够实现解耦(见第4节)。这一发现揭示了同时提升两者能力的可能性。基于此洞见,我们提出了速度利用秩学习(VERL)方法,首次通过直接塑造RL优势函数,实现了探索与利用协同增强的原则。其核心创新在于利用理论稳定的ERA作为预测元控制器,构建了一个协同的双通道激励机制。VERL不强制进行权衡,而是前瞻性地放大探索奖励以预防过度自信,并强化利用收益以巩固推理能力。跨多种大语言模型和推理基准的实验显示了一致的性能提升,包括在具有挑战性的2024年高考数据集上实现了高达21.4%的绝对准确率提升。
English
A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.
PDF382September 30, 2025