强化学习中基于群体层面自然语言反馈的引导式探索

摘要

大型语言模型（LLMs）通常通过与环境交互获得多样化的自然语言反馈。然而，当前强化学习（RL）算法仅依赖标量奖励，导致自然语言反馈中的丰富信息未被充分利用，探索效率低下。本研究提出GOLF强化学习框架，通过显式利用群体层面的语言反馈，以可执行的改进方案指导定向探索。GOLF整合两种互补的反馈源：（i）指出错误或提出针对性修正的外部评判；（ii）提供替代性局部思路和多样化失败模式的组内尝试。这些群体反馈被聚合生成高质量改进方案，作为离策略脚手架自适应注入训练过程，在稀疏奖励区域提供定向指导。同时，GOLF在统一强化学习循环中联合优化生成与改进能力，形成持续提升双重能力的良性循环。在可验证与不可验证基准测试上的实验表明，GOLF实现了卓越的性能和探索效率，相比仅使用标量奖励的强化学习方法，样本效率提升达2.2倍。代码已开源：https://github.com/LuckyyySTA/GOLF。

English

Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2times improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.