隨機策略評估足以實現基於可驗證獎勵的大語言模型推理

摘要

可驗證獎勵的強化學習（RLVR）已成為提升大型語言模型（LLMs）推理能力的一種前景廣闊的範式。現有方法主要依賴於如PPO和GRPO等策略優化框架，這些框架遵循廣義策略迭代，即在評估當前策略價值與基於評估改進策略之間交替進行。儘管有效，這些方法常遭遇訓練不穩定與多樣性崩潰的問題，需依賴複雜的啟發式技巧與精細調參。我們觀察到，在數學推理中，標準的RLVR可被形式化為一種特殊的有限時域馬爾可夫決策過程，其特徵在於確定性的狀態轉移、樹狀結構的動態性及二元終端獎勵。儘管規模龐大，其底層結構相比於流行RL算法（如PPO）所針對的通用控制場景更為簡單，這表明現有方法中的多項複雜技術或可簡化甚至省略。基於此洞察，我們證明了令人驚奇的結果：最優動作可從固定均勻隨機策略的Q函數中恢復，從而繞過廣義策略迭代循環及其相關啟發式方法。我們引入了“隨機策略估值促進多樣性推理”（ROVER），將這一原則轉化為適用於LLM數學推理的實用且可擴展的算法，這是一種極簡卻高效的RL方法，它從這些均勻策略Q值的softmax中採樣動作。ROVER在整個訓練過程中保持多樣性，允許持續探索多條有效路徑。在多種基礎模型與標準數學推理基準測試中，ROVER展現出在質量（pass@1提升8.2，pass@256提升16.8）與多樣性（提升17.6%）上的卓越表現，儘管其相比於現有強大而複雜的方法進行了根本性的簡化。

English

RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy's value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce Random Policy Valuation for Diverse Reasoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both quality (+8.2 on pass@1, +16.8 on pass@256) and diversity (+17.6\%), despite its radical simplification compared to strong, complicated existing methods.

隨機策略評估足以實現基於可驗證獎勵的大語言模型推理

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

摘要

Support