首次回歸，熵引導探索

摘要

基於可驗證獎勵的強化學習（RLVR）提升了大型語言模型（LLMs）的推理能力，但其在探索過程中存在不穩定性。我們提出了FR3E（首次回報、熵引導探索）這一結構化探索框架，該框架能夠識別推理軌跡中的高不確定性決策點，並執行針對性的回滾以構建語義基礎的中間反饋。我們的方法無需依賴密集監督即可提供針對性指導。在數學推理基準測試（AIME24）上的實驗結果表明，FR3E促進了更穩定的訓練，生成了更長且更連貫的回應，並提高了完全正確軌跡的比例。這些結果凸顯了該框架通過更為穩健和結構化的探索來提升LLM推理能力的有效性。

English

Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.