代理學習通過早期經驗

摘要

語言代理的長期目標是通過自身經驗進行學習與改進，最終在複雜的現實任務中超越人類。然而，在許多環境中，利用強化學習從經驗數據中訓練代理仍然面臨挑戰，這些環境要么缺乏可驗證的獎勵（例如網站），要么需要低效的長時序展開（例如多輪工具使用）。因此，當前大多數代理依賴於對專家數據的監督微調，這種方法難以擴展且泛化能力較差。這一限制源於專家示範的本質：它們僅捕捉了狹窄的場景範圍，並使代理暴露於有限的環境多樣性中。我們提出了一種折衷範式，稱之為早期經驗：由代理自身行為生成的交互數據，其中未來的狀態作為監督信號，而無需獎勵信號。在此範式下，我們研究了兩種利用此類數據的策略：（1）隱式世界建模，利用收集的狀態將策略基於環境動態進行錨定；（2）自我反思，代理從其次優行為中學習，以提升推理與決策能力。我們在八個多樣化的環境和多個模型家族中進行了評估。我們的方法持續提升了效能與跨域泛化能力，彰顯了早期經驗的價值。此外，在具有可驗證獎勵的環境中，我們的結果提供了積極信號，表明早期經驗為後續的強化學習奠定了堅實基礎，使其成為模仿學習與完全經驗驅動代理之間的實用橋樑。

English

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

代理學習通過早期經驗

Agent Learning via Early Experience

摘要

Support