EmbodiedOneVision:交錯式視覺-文本-動作預訓練,實現通用機器人控制
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
August 28, 2025
作者: Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Dong Wang
cs.AI
摘要
人類在開放世界中無縫執行多模態推理與物理互動的能力,是通用型具身智能系統的核心目標。近期,基於大規模機器人與視覺-文本數據共同訓練的視覺-語言-動作(VLA)模型,在通用機器人控制方面展現了顯著進展。然而,這些模型在交錯推理與互動方面仍未能達到人類水平的靈活性。在本研究中,我們引入了EO-Robotics,包含EO-1模型與EO-Data1.5M數據集。EO-1是一個統一的具身基礎模型,通過交錯的視覺-文本-動作預訓練,在多模態具身推理與機器人控制中實現了卓越性能。EO-1的開發基於兩大關鍵支柱:(i) 一個統一架構,能夠無差別地處理多模態輸入(圖像、文本、視頻和動作),以及(ii) 一個大規模、高質量的多模態具身推理數據集EO-Data1.5M,該數據集包含超過150萬個樣本,重點強調交錯的視覺-文本-動作理解。EO-1通過在EO-Data1.5M上結合自迴歸解碼與流匹配去噪進行訓練,從而實現無縫的機器人動作生成與多模態具身推理。大量實驗證明了交錯視覺-文本-動作學習在開放世界理解與泛化方面的有效性,並通過多種具身形式下的長時程、精細操作任務進行了驗證。本文詳細介紹了EO-1的架構、EO-Data1.5M的數據構建策略以及訓練方法,為開發先進的具身基礎模型提供了寶貴的見解。
English
The human ability to seamlessly perform multimodal reasoning and physical
interaction in the open world is a core goal for general-purpose embodied
intelligent systems. Recent vision-language-action (VLA) models, which are
co-trained on large-scale robot and visual-text data, have demonstrated notable
progress in general robot control. However, they still fail to achieve
human-level flexibility in interleaved reasoning and interaction. In this work,
introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is
a unified embodied foundation model that achieves superior performance in
multimodal embodied reasoning and robot control through interleaved
vision-text-action pre-training. The development of EO-1 is based on two key
pillars: (i) a unified architecture that processes multimodal inputs
indiscriminately (image, text, video, and action), and (ii) a massive,
high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains
over 1.5 million samples with emphasis on interleaved vision-text-action
comprehension. EO-1 is trained through synergies between auto-regressive
decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot
action generation and multimodal embodied reasoning. Extensive experiments
demonstrate the effectiveness of interleaved vision-text-action learning for
open-world understanding and generalization, validated through a variety of
long-horizon, dexterous manipulation tasks across multiple embodiments. This
paper details the architecture of EO-1, the data construction strategy of
EO-Data1.5M, and the training methodology, offering valuable insights for
developing advanced embodied foundation models.