ChatPaper.aiChatPaper

超越探索與利用的權衡:面向RLVR中LLM推理的隱藏狀態方法

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

September 28, 2025
作者: Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang
cs.AI

摘要

在可驗證獎勵的強化學習(RLVR)領域,主流觀點將最新進展解讀為探索與利用之間的權衡,這一視角主要基於詞元層面的指標。我們重新審視這一觀點,提出這種被感知的權衡可能並非根本性約束,而是測量層面的產物。為探究此點,我們將分析轉移至語義豐富的隱藏狀態空間,採用有效秩(ER)來量化探索,並提出其新穎的一階和二階導數,分別命名為有效秩速度(ERV)和有效秩加速度(ERA),以捕捉利用動態。我們的分析揭示,在隱藏狀態層面,探索與利用能夠被解耦(見第4節)。這一發現揭示了同時提升兩者能力的可能性。這一洞見激發了我們的方法——速度利用秩學習(VERL),這是首個通過直接塑造RL優勢函數來實現探索與利用協同增強原則的方法。其關鍵創新在於利用理論上穩定的ERA作為預測性元控制器,創建一個協同的雙通道激勵結構。VERL並非強制權衡,而是前瞻性地放大探索獎勵以防止過度自信,並強化利用收益以鞏固推理。在多樣化的大型語言模型和推理基準上的實驗顯示出一致的增益,包括在具有挑戰性的2024年高考數據集上實現了高達21.4%的絕對準確率提升。
English
A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.
PDF392September 30, 2025