智能体通过早期经验学习

摘要

语言智能体的长期目标是通过自身经验进行学习与提升，最终在复杂的现实任务中超越人类。然而，在许多环境中，利用经验数据通过强化学习训练智能体仍然面临困难，这些环境要么缺乏可验证的奖励机制（如网站），要么需要低效的长周期展开（如多轮工具使用）。因此，当前大多数智能体依赖于对专家数据进行监督微调，这种方法难以扩展且泛化能力较差。这一局限源于专家示范的本质：它们仅捕捉了有限场景，使智能体暴露于环境多样性的狭窄范围。我们提出了一种折中范式，称为早期经验：即由智能体自身行为生成的交互数据，其中未来状态作为监督信号，无需奖励反馈。在此范式下，我们研究了两种利用此类数据的策略：（1）隐式世界建模，利用收集的状态将策略锚定于环境动态中；（2）自我反思，智能体从其次优行动中学习，以改进推理与决策能力。我们在八个多样化环境及多种模型家族中进行了评估。我们的方法持续提升了效能与跨领域泛化能力，凸显了早期经验的价值。此外，在具备可验证奖励的环境中，我们的结果提供了积极信号，表明早期经验为后续强化学习奠定了坚实基础，将其定位为模仿学习与完全经验驱动智能体之间的实用桥梁。

English

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

智能体通过早期经验学习

Agent Learning via Early Experience

摘要

Support