代理學習通過早期經驗
Agent Learning via Early Experience
October 9, 2025
作者: Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu
cs.AI
摘要
語言代理的長期目標是通過自身經驗進行學習與改進,最終在複雜的現實任務中超越人類。然而,在許多環境中,利用強化學習從經驗數據中訓練代理仍然面臨挑戰,這些環境要么缺乏可驗證的獎勵(例如網站),要么需要低效的長時序展開(例如多輪工具使用)。因此,當前大多數代理依賴於對專家數據的監督微調,這種方法難以擴展且泛化能力較差。這一限制源於專家示範的本質:它們僅捕捉了狹窄的場景範圍,並使代理暴露於有限的環境多樣性中。我們提出了一種折衷範式,稱之為早期經驗:由代理自身行為生成的交互數據,其中未來的狀態作為監督信號,而無需獎勵信號。在此範式下,我們研究了兩種利用此類數據的策略:(1)隱式世界建模,利用收集的狀態將策略基於環境動態進行錨定;(2)自我反思,代理從其次優行為中學習,以提升推理與決策能力。我們在八個多樣化的環境和多個模型家族中進行了評估。我們的方法持續提升了效能與跨域泛化能力,彰顯了早期經驗的價值。此外,在具有可驗證獎勵的環境中,我們的結果提供了積極信號,表明早期經驗為後續的強化學習奠定了堅實基礎,使其成為模仿學習與完全經驗驅動代理之間的實用橋樑。
English
A long-term goal of language agents is to learn and improve through their own
experience, ultimately outperforming humans in complex, real-world tasks.
However, training agents from experience data with reinforcement learning
remains difficult in many environments, which either lack verifiable rewards
(e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn
tool use). As a result, most current agents rely on supervised fine-tuning on
expert data, which is challenging to scale and generalizes poorly. This
limitation stems from the nature of expert demonstrations: they capture only a
narrow range of scenarios and expose the agent to limited environment
diversity. We address this limitation with a middle-ground paradigm we call
early experience: interaction data generated by the agent's own actions, where
the resulting future states serve as supervision without reward signals. Within
this paradigm we study two strategies of using such data: (1) Implicit world
modeling, which uses collected states to ground the policy in environment
dynamics; and (2) Self-reflection, where the agent learns from its suboptimal
actions to improve reasoning and decision-making. We evaluate across eight
diverse environments and multiple model families. Our approaches consistently
improve effectiveness and out-of-domain generalization, highlighting the value
of early experience. Moreover, in environments with verifiable rewards, our
results provide promising signals that early experience offers a strong
foundation for subsequent reinforcement learning, positioning it as a practical
bridge between imitation learning and fully experience-driven agents.