回溯智能体：从问题求解到自主演进的回溯式双重内在反馈机制

摘要

基于大语言模型（LLM）的强化学习（RL）智能体在复杂交互任务中展现出巨大潜力。然而传统RL范式偏向静态问题求解而缺乏持续适应能力：智能体常因探索不足收敛至次优策略，且所学知识隐式存储于参数中难以显式检索，限制了有效经验学习。为此我们提出RetroAgent——一种支持智能体通过持续进化（而不仅是问题解决）来掌控复杂交互环境的在线RL框架。该框架核心是具备双重内在反馈的后见自省机制：（1）追踪当前尝试相对于历史进度的子任务完成度、奖励潜在探索路径的数值化反馈；（2）将可复用经验提炼存储至记忆库的语言化反馈。我们同时提出相似度与效用感知上置信界（SimUtil-UCB）检索策略，通过平衡相关性、实用性与探索性实现历史经验的高效利用。在四个挑战性智能体任务上对两类模型系列的实验表明，RetroAgent显著超越现有方法：在ALFWorld、WebShop、Sokoban和扫雷任务上分别较GRPO训练智能体提升18.3%、15.4%、27.1%和8.9%，同时展现出强大的测试时适应能力与对分布外场景的泛化性能。

English

Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results -- e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.