EmbodiedOneVision: 汎用ロボット制御のための視覚-テキスト-行動のインタリーブ型事前学習

要旨

人間がオープンワールドにおいてシームレスにマルチモーダル推論と物理的相互作用を遂行する能力は、汎用型エンボディドインテリジェントシステムの核心的な目標である。近年、大規模なロボットデータと視覚-テキストデータを共同で学習した視覚-言語-行動（VLA）モデルは、汎用ロボット制御において顕著な進歩を示している。しかし、それらは依然として、推論と相互作用を交互に行う人間レベルの柔軟性を達成できていない。本研究では、EO-1モデルとEO-Data1.5MデータセットからなるEO-Roboticsを紹介する。EO-1は、視覚-テキスト-行動を交互に事前学習することで、マルチモーダルエンボディド推論とロボット制御において優れた性能を発揮する統一エンボディド基盤モデルである。EO-1の開発は、以下の2つの重要な柱に基づいている：(i) 画像、テキスト、ビデオ、行動といったマルチモーダル入力を区別なく処理する統一アーキテクチャ、(ii) 視覚-テキスト-行動の交互理解に重点を置いた150万以上のサンプルを含む大規模で高品質なマルチモーダルエンボディド推論データセット、EO-Data1.5M。EO-1は、EO-Data1.5M上での自己回帰デコーディングとフローマッチングデノイジングの相乗効果を通じて学習され、シームレスなロボット行動生成とマルチモーダルエンボディド推論を可能にする。広範な実験により、オープンワールド理解と一般化のための視覚-テキスト-行動の交互学習の有効性が実証され、複数のエンボディメントにわたる長期的で器用な操作タスクを通じて検証された。本論文では、EO-1のアーキテクチャ、EO-Data1.5Mのデータ構築戦略、および学習方法論を詳細に説明し、先進的なエンボディド基盤モデルの開発に貴重な洞察を提供する。

English

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.

EmbodiedOneVision: 汎用ロボット制御のための視覚-テキスト-行動のインタリーブ型事前学習

EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

要旨

Support