首次回报，熵引导探索

摘要

基于可验证奖励的强化学习（RLVR）提升了大型语言模型（LLMs）的推理能力，但其在探索过程中存在不稳定性。我们提出了FR3E（首次回报、熵激发探索）框架，这一结构化探索方法能够识别推理轨迹中的高不确定性决策点，并通过定向回滚构建语义基础的中期反馈。该方法无需依赖密集监督即可提供针对性指导。在数学推理基准测试（AIME24）上的实证结果表明，FR3E促进了更稳定的训练过程，生成了更长且更连贯的响应，并提高了完全正确轨迹的比例。这些成果凸显了该框架通过更稳健和结构化的探索，有效提升LLM推理能力的优势。

English

Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.