RetroAgent:通过回溯式双重内在反馈实现从解题到进化的跨越
RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback
March 9, 2026
作者: Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao
cs.AI
摘要
基于大型语言模型(LLM)的智能体通过强化学习(RL)训练后,在复杂交互任务中展现出巨大潜力。然而传统RL范式偏向静态问题求解而非持续适应:智能体常因探索不足而收敛至次优策略,所学知识也仅隐式存储于参数中难以显式调用,限制了有效经验学习。为此我们提出RetroAgent——一种在线RL框架,使智能体不仅能解决问题,更能通过持续进化掌握复杂交互环境。具体而言,RetroAgent具备 hindsight 自反思机制,可生成双重内在反馈:(1)追踪当前尝试相对于历史进度的增量式子任务完成度,为有潜力的探索提供奖励的数值反馈;(2)将可复用经验提炼存储至记忆缓冲区,并通过我们提出的相似度与效用感知上置信界(SimUtil-UCB)策略进行检索的语言反馈。该策略平衡相关性、效用性与探索性,实现对过往经验的高效利用。在四个挑战性智能体任务上对两类模型系列的实验表明,RetroAgent显著优于现有方法,取得突破性成果——如在ALFWorld任务上超越GRPO训练智能体18.3%、WebShop提升15.4%、Sokoban提高27.1%、MineSweeper增长8.9%——同时展现出强大的测试时适应能力与对分布外场景的泛化性能。
English
Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results -- e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.