EmbodiedOneVision：交錯式視覺-文本-動作預訓練，實現通用機器人控制

摘要

人類在開放世界中無縫執行多模態推理與物理互動的能力，是通用型具身智能系統的核心目標。近期，基於大規模機器人與視覺-文本數據共同訓練的視覺-語言-動作（VLA）模型，在通用機器人控制方面展現了顯著進展。然而，這些模型在交錯推理與互動方面仍未能達到人類水平的靈活性。在本研究中，我們引入了EO-Robotics，包含EO-1模型與EO-Data1.5M數據集。EO-1是一個統一的具身基礎模型，通過交錯的視覺-文本-動作預訓練，在多模態具身推理與機器人控制中實現了卓越性能。EO-1的開發基於兩大關鍵支柱：(i) 一個統一架構，能夠無差別地處理多模態輸入（圖像、文本、視頻和動作），以及(ii) 一個大規模、高質量的多模態具身推理數據集EO-Data1.5M，該數據集包含超過150萬個樣本，重點強調交錯的視覺-文本-動作理解。EO-1通過在EO-Data1.5M上結合自迴歸解碼與流匹配去噪進行訓練，從而實現無縫的機器人動作生成與多模態具身推理。大量實驗證明了交錯視覺-文本-動作學習在開放世界理解與泛化方面的有效性，並通過多種具身形式下的長時程、精細操作任務進行了驗證。本文詳細介紹了EO-1的架構、EO-Data1.5M的數據構建策略以及訓練方法，為開發先進的具身基礎模型提供了寶貴的見解。

English

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.

EmbodiedOneVision：交錯式視覺-文本-動作預訓練，實現通用機器人控制

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

摘要

Support