検証可能な報酬を用いたLLM推論においては、ランダム方策評価で十分である

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLM）の推論能力を向上させるための有望なパラダイムとして登場しました。現在の手法は主にPPOやGRPOなどのポリシー最適化フレームワークに依存しており、これらは現在のポリシーの価値を評価し、その評価に基づいてポリシーを改善するという一般化されたポリシー反復を採用しています。これらの手法は効果的ではあるものの、訓練の不安定性や多様性の崩壊に悩まされることが多く、複雑なヒューリスティックな工夫や慎重な調整を必要とします。我々は、数学的推論における標準的なRLVRが、決定論的な状態遷移、木構造のダイナミクス、および二値の終端報酬を持つ特殊な有限時間マルコフ決定過程として形式化できることを観察しました。規模は大きいものの、その基盤となる構造は、一般的な制御設定（例えば、PPOが開発されたような）よりも単純であり、既存の手法におけるいくつかの高度な技術が削減または省略可能であることを示唆しています。この洞察に基づき、我々は驚くべき結果を証明しました：最適な行動は、固定された一様ランダムポリシーのQ関数から回復可能であり、それによって一般化されたポリシー反復ループとそれに伴うヒューリスティックを回避できることを示しました。我々は、この原理を実践的かつスケーラブルなLLM数学推論アルゴリズムに変換するために、Random Policy Valuation for Diverse Reasoning（ROVER）を導入しました。これは、これらの一様ポリシーQ値に基づくソフトマックスから行動をサンプリングする、ミニマリストでありながら非常に効果的なRL手法です。ROVERは訓練全体を通じて多様性を維持し、複数の有効な経路の持続的な探索を可能にします。複数のベースモデルと標準的な数学的推論ベンチマークにおいて、ROVERは既存の強力で複雑な手法と比較しても、品質（pass@1で+8.2、pass@256で+16.8）と多様性（+17.6%）の両方で優れた性能を示しました。

English

RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy's value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce Random Policy Valuation for Diverse Reasoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both quality (+8.2 on pass@1, +16.8 on pass@256) and diversity (+17.6\%), despite its radical simplification compared to strong, complicated existing methods.

検証可能な報酬を用いたLLM推論においては、ランダム方策評価で十分である

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

要旨

Support