ERA:通過具身先驗學習與線上強化學習將視覺語言模型轉化為具身代理
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
October 14, 2025
作者: Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan, Mengchao Zhang, Jose Barreiros, Aykut Onol, ChengXiang Zhai, Heng Ji, Manling Li, Huan Zhang, Tong Zhang
cs.AI
摘要
近期在具身人工智慧領域的進展,凸顯了視覺語言模型(VLMs)作為在複雜環境中具備感知、推理與互動能力的代理者的潛力。然而,表現最佳的系統依賴於大規模模型,部署成本高昂,而較小的VLMs則缺乏成功所需的知識與技能。為彌補這一差距,我們提出了具身推理代理(ERA),這是一個整合先驗知識學習與線上強化學習(RL)的兩階段框架。第一階段,具身先驗學習,從三類數據中提煉基礎知識:(1)軌跡增強先驗,通過更強模型生成的結構化推理來豐富現有軌跡數據;(2)環境錨定先驗,提供環境內知識與接地監督;(3)外部知識先驗,從環境外數據集中遷移通用知識。第二階段,我們開發了一個基於這些先驗的線上RL管道,進一步提升代理性能。為克服代理RL固有的挑戰,包括長時程、稀疏獎勵與訓練不穩定性,我們引入了三個關鍵設計:用於上下文管理的自我總結、密集獎勵塑形,以及回合級策略優化。在高層次規劃(EB-ALFRED)與低層次控制(EB-Manipulation)任務上的廣泛實驗表明,ERA-3B超越了基於提示的大模型與先前的訓練基線。具體而言,它在EB-ALFRED上整體提升了8.4%,在EB-Manipulation上提升了19.4%,相較於GPT-4o,並展現出對未見任務的強大泛化能力。總體而言,ERA為可擴展的具身智慧提供了一條實用路徑,為未來的具身AI系統提供了方法論上的洞見。
English
Recent advances in embodied AI highlight the potential of vision language
models (VLMs) as agents capable of perception, reasoning, and interaction in
complex environments. However, top-performing systems rely on large-scale
models that are costly to deploy, while smaller VLMs lack the necessary
knowledge and skills to succeed. To bridge this gap, we present
Embodied Reasoning Agent (ERA), a two-stage framework that integrates
prior knowledge learning and online reinforcement learning (RL). The first
stage, Embodied Prior Learning, distills foundational knowledge from
three types of data: (1) Trajectory-Augmented Priors, which enrich existing
trajectory data with structured reasoning generated by stronger models; (2)
Environment-Anchored Priors, which provide in-environment knowledge and
grounding supervision; and (3) External Knowledge Priors, which transfer
general knowledge from out-of-environment datasets. In the second stage, we
develop an online RL pipeline that builds on these priors to further enhance
agent performance. To overcome the inherent challenges in agent RL, including
long horizons, sparse rewards, and training instability, we introduce three key
designs: self-summarization for context management, dense reward shaping, and
turn-level policy optimization. Extensive experiments on both high-level
planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate
that ERA-3B surpasses both prompting-based large models and previous
training-based baselines. Specifically, it achieves overall improvements of
8.4\% on EB-ALFRED and 19.4\% on EB-Manipulation over GPT-4o, and exhibits
strong generalization to unseen tasks. Overall, ERA offers a practical path
toward scalable embodied intelligence, providing methodological insights for
future embodied AI systems.