随机策略评估足以支持具有可验证奖励的大语言模型推理

摘要

可验证奖励的强化学习（RLVR）已成为提升大语言模型（LLMs）推理能力的一种有前景的范式。当前方法主要依赖于如PPO和GRPO等策略优化框架，这些框架遵循广义策略迭代，即在评估当前策略价值与基于评估改进策略之间交替进行。尽管有效，这些方法常面临训练不稳定性和多样性崩溃的问题，需要复杂的启发式技巧和精细调参。我们观察到，在数学推理中，标准RLVR可被形式化为一种特殊的有限时域马尔可夫决策过程，具有确定性的状态转移、树状动态结构及二元终止奖励。尽管规模庞大，其底层结构比开发流行RL算法（如PPO）的通用控制场景更为简单，这表明现有方法中的多项复杂技术或许可以简化甚至省略。基于这一洞察，我们证明了一个令人惊讶的结果：最优动作可以从固定均匀随机策略的Q函数中恢复，从而绕过了广义策略迭代循环及其相关启发式方法。我们引入了“随机策略估值促进多样化推理”（ROVER），将这一原则转化为适用于LLM数学推理的实用且可扩展算法，这是一种极简却高效的RL方法，它从这些均匀策略Q值的softmax中采样动作。ROVER在整个训练过程中保持了多样性，允许持续探索多种有效路径。在多个基础模型和标准数学推理基准测试中，ROVER在质量（pass@1提升8.2，pass@256提升16.8）和多样性（提升17.6%）上均展现出卓越性能，尽管与现有复杂方法相比，其进行了根本性的简化。

English

RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy's value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce Random Policy Valuation for Diverse Reasoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both quality (+8.2 on pass@1, +16.8 on pass@256) and diversity (+17.6\%), despite its radical simplification compared to strong, complicated existing methods.

随机策略评估足以支持具有可验证奖励的大语言模型推理

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

摘要

Support