告别刻板反馈:开放世界智能体学习中的协同进化评价体系
No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning
January 11, 2026
作者: Zhicong Li, Lingjie Jiang, Yulan Hu, Xingchen Zeng, Yixia Li, Xiangwen Zhang, Guanhua Chen, Zheng Pan, Xin Li, Yong Liu
cs.AI
摘要
基於評判指導的強化學習(RL)已成為訓練大型語言模型代理的強大範式,其通過自然語言反饋來增強稀疏的結果獎勵。然而,現有方法通常依賴靜態或離線的評判模型,無法隨策略演進而動態調整。在在線策略強化學習中,代理的錯誤模式會隨時間變化,導致固定評判器逐漸失效,所提供的反饋效用遞減。為解決此問題,我們提出 ECHO(基於事後優化的動態評判器框架),該框架通過同步協同演化循環聯合優化策略與評判器。ECHO採用級聯式軌跡生成機制:評判器對初始軌跡生成多重診斷,隨後進行策略細化以實現群結構優勢估計。我們通過飽和感知增益調節目標解決學習平台期挑戰,該目標獎勵評判器在高性能軌跡中誘導增量改進。通過採用雙軌GRPO更新機制,ECHO確保評判器反饋與演化策略保持同步。實驗結果表明,ECHO在開放世界環境中能實現更穩定的訓練效果和更高的長週期任務成功率。
English
Critique-guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural-language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on-policy RL, the agent's error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility. To address this, we introduce ECHO (Evolving Critic for Hindsight-Guided Optimization)}, a framework that jointly optimizes the policy and critic through a synchronized co-evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group-structured advantage estimation. We address the challenge of learning plateaus via a saturation-aware gain shaping objective, which rewards the critic for inducing incremental improvements in high-performing trajectories. By employing dual-track GRPO updates, ECHO ensures the critic's feedback stays synchronized with the evolving policy. Experimental results show that ECHO yields more stable training and higher long-horizon task success across open-world environments.