ChatPaper.aiChatPaper

告别刻板反馈:开放世界智能体学习中的协同进化评价体系

No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

January 11, 2026
作者: Zhicong Li, Lingjie Jiang, Yulan Hu, Xingchen Zeng, Yixia Li, Xiangwen Zhang, Guanhua Chen, Zheng Pan, Xin Li, Yong Liu
cs.AI

摘要

基于评判引导的强化学习(RL)已成为训练大语言模型(LLM)智能体的重要范式,其通过自然语言反馈增强稀疏的结果奖励。然而,现有方法通常依赖静态或离线的评判模型,无法随策略演化而动态调整。在策略性RL中,智能体的错误模式会随时间变化,导致固定评判器逐渐失效,所提供的反馈效用递减。为解决这一问题,我们提出ECHO(基于后见优化的动态评判器)框架,通过同步协同进化循环实现策略与评判器的联合优化。ECHO采用级联式轨迹生成机制:评判器对初始轨迹生成多重诊断,继而通过策略优化实现分组结构化优势估计。针对学习平台期挑战,我们提出饱和感知增益重塑目标,激励评判器在高性能轨迹中引导渐进式改进。通过双轨GRPO更新机制,ECHO确保评判反馈与演化策略保持同步。实验结果表明,在开放世界环境中,ECHO能实现更稳定的训练效果,并在长周期任务中取得更高成功率。
English
Critique-guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural-language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on-policy RL, the agent's error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility. To address this, we introduce ECHO (Evolving Critic for Hindsight-Guided Optimization)}, a framework that jointly optimizes the policy and critic through a synchronized co-evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group-structured advantage estimation. We address the challenge of learning plateaus via a saturation-aware gain shaping objective, which rewards the critic for inducing incremental improvements in high-performing trajectories. By employing dual-track GRPO updates, ECHO ensures the critic's feedback stays synchronized with the evolving policy. Experimental results show that ECHO yields more stable training and higher long-horizon task success across open-world environments.
PDF11January 16, 2026