ERA：通過具身先驗學習與線上強化學習將視覺語言模型轉化為具身代理

摘要

近期在具身人工智慧領域的進展，凸顯了視覺語言模型（VLMs）作為在複雜環境中具備感知、推理與互動能力的代理者的潛力。然而，表現最佳的系統依賴於大規模模型，部署成本高昂，而較小的VLMs則缺乏成功所需的知識與技能。為彌補這一差距，我們提出了具身推理代理（ERA），這是一個整合先驗知識學習與線上強化學習（RL）的兩階段框架。第一階段，具身先驗學習，從三類數據中提煉基礎知識：（1）軌跡增強先驗，通過更強模型生成的結構化推理來豐富現有軌跡數據；（2）環境錨定先驗，提供環境內知識與接地監督；（3）外部知識先驗，從環境外數據集中遷移通用知識。第二階段，我們開發了一個基於這些先驗的線上RL管道，進一步提升代理性能。為克服代理RL固有的挑戰，包括長時程、稀疏獎勵與訓練不穩定性，我們引入了三個關鍵設計：用於上下文管理的自我總結、密集獎勵塑形，以及回合級策略優化。在高層次規劃（EB-ALFRED）與低層次控制（EB-Manipulation）任務上的廣泛實驗表明，ERA-3B超越了基於提示的大模型與先前的訓練基線。具體而言，它在EB-ALFRED上整體提升了8.4%，在EB-Manipulation上提升了19.4%，相較於GPT-4o，並展現出對未見任務的強大泛化能力。總體而言，ERA為可擴展的具身智慧提供了一條實用路徑，為未來的具身AI系統提供了方法論上的洞見。

English

Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present Embodied Reasoning Agent (ERA), a two-stage framework that integrates prior knowledge learning and online reinforcement learning (RL). The first stage, Embodied Prior Learning, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of-environment datasets. In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization. Extensive experiments on both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate that ERA-3B surpasses both prompting-based large models and previous training-based baselines. Specifically, it achieves overall improvements of 8.4\% on EB-ALFRED and 19.4\% on EB-Manipulation over GPT-4o, and exhibits strong generalization to unseen tasks. Overall, ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.

ERA：通過具身先驗學習與線上強化學習將視覺語言模型轉化為具身代理

ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

摘要

Support