ChatPaper.aiChatPaper

EmbodiedOneVision:面向通用机器人控制的视觉-文本-动作交错预训练

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

August 28, 2025
作者: Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Dong Wang
cs.AI

摘要

人类在开放世界中无缝执行多模态推理与物理交互的能力,是通用型具身智能系统的核心目标。近期,通过大规模机器人及视觉-文本数据联合训练的视觉-语言-动作(VLA)模型,在通用机器人控制方面取得了显著进展。然而,这些模型在交替推理与交互方面仍未能达到人类水平的灵活性。本研究中,我们推出了EO-Robotics,包含EO-1模型与EO-Data1.5M数据集。EO-1是一个统一的具身基础模型,通过交替的视觉-文本-动作预训练,在多模态具身推理与机器人控制上实现了卓越性能。EO-1的开发基于两大支柱:(i) 一个统一架构,无差别处理多模态输入(图像、文本、视频及动作),以及(ii) 一个大规模、高质量的多模态具身推理数据集EO-Data1.5M,该数据集包含超过150万样本,着重于交替视觉-文本-动作理解。EO-1通过在EO-Data1.5M上结合自回归解码与流匹配去噪的协同训练,实现了无缝机器人动作生成与多模态具身推理。大量实验验证了交替视觉-文本-动作学习在开放世界理解与泛化中的有效性,并通过多种长时程、精细操作任务在多个具身体现上得到了验证。本文详细阐述了EO-1的架构、EO-Data1.5M的数据构建策略及训练方法,为开发先进的具身基础模型提供了宝贵洞见。
English
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.
PDF713September 1, 2025