基於群組層級自然語言反饋的強化學習探索引導機制

摘要

大型语言模型（LLMs）通常通过与环境交互获得多样化的自然语言反馈。然而，当前强化学习（RL）算法仅依赖标量奖励，导致自然语言反馈中的丰富信息未被充分利用，从而造成探索效率低下。本研究提出GOLF——一种通过可执行修正来利用群体层级语言反馈指导定向探索的RL框架。该框架聚合两种互补的反馈源：（i）指出错误或提出针对性修正的外部评判；（ii）提供替代性局部思路及多样化失败模式的组内尝试。这些群体层级反馈经聚合后生成高质量修正方案，并以自适应方式作为离策略脚手架注入训练过程，为稀疏奖励区域提供定向指导。与此同时，GOLF在统一RL循环中联合优化生成与修正能力，形成持续提升双重能力的良性循环。在可验证与不可验证基准测试上的实验表明，GOLF实现了卓越的性能与探索效率，其样本效率较仅使用标量奖励的RL方法提升2.2倍。代码已发布于https://github.com/LuckyyySTA/GOLF。

English

Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2times improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.

基於群組層級自然語言反饋的強化學習探索引導機制

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

摘要

Support