ChatPaper.aiChatPaper

智能体通过早期经验学习

Agent Learning via Early Experience

October 9, 2025
作者: Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu
cs.AI

摘要

语言智能体的长期目标是通过自身经验进行学习与提升,最终在复杂的现实任务中超越人类。然而,在许多环境中,利用经验数据通过强化学习训练智能体仍然面临困难,这些环境要么缺乏可验证的奖励机制(如网站),要么需要低效的长周期展开(如多轮工具使用)。因此,当前大多数智能体依赖于对专家数据进行监督微调,这种方法难以扩展且泛化能力较差。这一局限源于专家示范的本质:它们仅捕捉了有限场景,使智能体暴露于环境多样性的狭窄范围。我们提出了一种折中范式,称为早期经验:即由智能体自身行为生成的交互数据,其中未来状态作为监督信号,无需奖励反馈。在此范式下,我们研究了两种利用此类数据的策略:(1)隐式世界建模,利用收集的状态将策略锚定于环境动态中;(2)自我反思,智能体从其次优行动中学习,以改进推理与决策能力。我们在八个多样化环境及多种模型家族中进行了评估。我们的方法持续提升了效能与跨领域泛化能力,凸显了早期经验的价值。此外,在具备可验证奖励的环境中,我们的结果提供了积极信号,表明早期经验为后续强化学习奠定了坚实基础,将其定位为模仿学习与完全经验驱动智能体之间的实用桥梁。
English
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
PDF1408October 10, 2025