ERA:通过具身先验学习与在线强化学习将视觉语言模型转化为具身智能体
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
October 14, 2025
作者: Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan, Mengchao Zhang, Jose Barreiros, Aykut Onol, ChengXiang Zhai, Heng Ji, Manling Li, Huan Zhang, Tong Zhang
cs.AI
摘要
近期在具身智能领域的进展凸显了视觉语言模型(VLMs)作为能够在复杂环境中进行感知、推理和交互的智能体的潜力。然而,表现最优的系统依赖于大规模模型,部署成本高昂,而较小的VLMs则缺乏成功所需的知识和技能。为弥合这一差距,我们提出了具身推理智能体(ERA),一个两阶段框架,整合了先验知识学习和在线强化学习(RL)。第一阶段,具身先验学习,从三类数据中提炼基础知识:(1)轨迹增强先验,通过更强模型生成的结构化推理丰富现有轨迹数据;(2)环境锚定先验,提供环境内知识及接地监督;(3)外部知识先验,从环境外数据集中迁移通用知识。第二阶段,我们开发了一个在线RL管道,基于这些先验进一步提升智能体性能。为克服智能体RL中固有的挑战,包括长视野、稀疏奖励和训练不稳定性,我们引入了三项关键设计:用于上下文管理的自我总结、密集奖励塑造和回合级策略优化。在高层规划(EB-ALFRED)和低层控制(EB-Manipulation)任务上的广泛实验表明,ERA-3B超越了基于提示的大型模型和以往基于训练的基线,具体而言,在EB-ALFRED上整体提升8.4%,在EB-Manipulation上提升19.4%,并展现出对未见任务的强大泛化能力。总体而言,ERA为可扩展的具身智能提供了一条实用路径,为未来具身AI系统提供了方法论启示。
English
Recent advances in embodied AI highlight the potential of vision language
models (VLMs) as agents capable of perception, reasoning, and interaction in
complex environments. However, top-performing systems rely on large-scale
models that are costly to deploy, while smaller VLMs lack the necessary
knowledge and skills to succeed. To bridge this gap, we present
Embodied Reasoning Agent (ERA), a two-stage framework that integrates
prior knowledge learning and online reinforcement learning (RL). The first
stage, Embodied Prior Learning, distills foundational knowledge from
three types of data: (1) Trajectory-Augmented Priors, which enrich existing
trajectory data with structured reasoning generated by stronger models; (2)
Environment-Anchored Priors, which provide in-environment knowledge and
grounding supervision; and (3) External Knowledge Priors, which transfer
general knowledge from out-of-environment datasets. In the second stage, we
develop an online RL pipeline that builds on these priors to further enhance
agent performance. To overcome the inherent challenges in agent RL, including
long horizons, sparse rewards, and training instability, we introduce three key
designs: self-summarization for context management, dense reward shaping, and
turn-level policy optimization. Extensive experiments on both high-level
planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate
that ERA-3B surpasses both prompting-based large models and previous
training-based baselines. Specifically, it achieves overall improvements of
8.4\% on EB-ALFRED and 19.4\% on EB-Manipulation over GPT-4o, and exhibits
strong generalization to unseen tasks. Overall, ERA offers a practical path
toward scalable embodied intelligence, providing methodological insights for
future embodied AI systems.