最初のリターン、エントロピー誘導型探索

要旨

検証可能な報酬からの強化学習（RLVR）は大規模言語モデル（LLM）の推論能力を向上させますが、不安定な探索に苦戦しています。本論文では、FR3E（First Return, Entropy-Eliciting Explore）を提案します。これは、推論軌跡における高不確実性の意思決定ポイントを特定し、ターゲットを絞ったロールアウトを実行して意味的に根拠のある中間フィードバックを構築する構造化された探索フレームワークです。本手法は、密な監視に依存することなく、ターゲットを絞ったガイダンスを提供します。数学的推論ベンチマーク（AIME24）での実験結果は、FR3Eがより安定した訓練を促進し、より長く一貫性のある応答を生成し、完全に正しい軌跡の割合を増加させることを示しています。これらの結果は、本フレームワークがより堅牢で構造化された探索を通じてLLMの推論を改善する有効性を強調しています。

English

Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.