首次回报,熵引导探索
First Return, Entropy-Eliciting Explore
July 9, 2025
作者: Tianyu Zheng, Tianshun Xing, Qingshui Gu, Taoran Liang, Xingwei Qu, Xin Zhou, Yizhi Li, Zhoufutu Wen, Chenghua Lin, Wenhao Huang, Qian Liu, Ge Zhang, Zejun Ma
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)提升了大型语言模型(LLMs)的推理能力,但其在探索过程中存在不稳定性。我们提出了FR3E(首次回报、熵激发探索)框架,这一结构化探索方法能够识别推理轨迹中的高不确定性决策点,并通过定向回滚构建语义基础的中期反馈。该方法无需依赖密集监督即可提供针对性指导。在数学推理基准测试(AIME24)上的实证结果表明,FR3E促进了更稳定的训练过程,生成了更长且更连贯的响应,并提高了完全正确轨迹的比例。这些成果凸显了该框架通过更稳健和结构化的探索,有效提升LLM推理能力的优势。
English
Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning
abilities of Large Language Models (LLMs) but it struggles with unstable
exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a
structured exploration framework that identifies high-uncertainty decision
points in reasoning trajectories and performs targeted rollouts to construct
semantically grounded intermediate feedback. Our method provides targeted
guidance without relying on dense supervision. Empirical results on
mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable
training, produces longer and more coherent responses, and increases the
proportion of fully correct trajectories. These results highlight the
framework's effectiveness in improving LLM reasoning through more robust and
structured exploration.