EmbodiedOneVision：面向通用机器人控制的视觉-文本-动作交错预训练

摘要

人类在开放世界中无缝执行多模态推理与物理交互的能力，是通用型具身智能系统的核心目标。近期，通过大规模机器人及视觉-文本数据联合训练的视觉-语言-动作（VLA）模型，在通用机器人控制方面取得了显著进展。然而，这些模型在交替推理与交互方面仍未能达到人类水平的灵活性。本研究中，我们推出了EO-Robotics，包含EO-1模型与EO-Data1.5M数据集。EO-1是一个统一的具身基础模型，通过交替的视觉-文本-动作预训练，在多模态具身推理与机器人控制上实现了卓越性能。EO-1的开发基于两大支柱：(i) 一个统一架构，无差别处理多模态输入（图像、文本、视频及动作），以及(ii) 一个大规模、高质量的多模态具身推理数据集EO-Data1.5M，该数据集包含超过150万样本，着重于交替视觉-文本-动作理解。EO-1通过在EO-Data1.5M上结合自回归解码与流匹配去噪的协同训练，实现了无缝机器人动作生成与多模态具身推理。大量实验验证了交替视觉-文本-动作学习在开放世界理解与泛化中的有效性，并通过多种长时程、精细操作任务在多个具身体现上得到了验证。本文详细阐述了EO-1的架构、EO-Data1.5M的数据构建策略及训练方法，为开发先进的具身基础模型提供了宝贵洞见。

English

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.

EmbodiedOneVision：面向通用机器人控制的视觉-文本-动作交错预训练

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

摘要

Support