OmniEVA：タスク適応型3Dグラウンディングと身体性認識推論による多用途プランナー

要旨

近年のマルチモーダル大規模言語モデル（MLLM）の進展により、エンボディドインテリジェンスの新たな可能性が開かれ、マルチモーダルな理解、推論、インタラクション、そして継続的な空間的意思決定が可能となった。しかし、現在のMLLMベースのエンボディドシステムには2つの重要な課題がある。第一に、幾何学的適応性のギャップ：2D入力のみでトレーニングされたモデル、またはハードコードされた3Dジオメトリ注入を伴うモデルは、空間情報の不足または2D一般化の制限に苦しみ、多様な空間要求を伴うタスク間での適応性が低い。第二に、エンボディメント制約のギャップ：従来の研究では、実際のロボットの物理的制約や能力を無視することが多く、理論的には有効だが実践的には実行不可能なタスクプランを生み出している。これらのギャップを解決するため、我々はOmniEVAを導入する。これは、2つの重要なイノベーションを通じて高度なエンボディド推論とタスクプランニングを可能にするエンボディド多目的プランナーである：（1）タスク適応型3Dグラウンディングメカニズム。これは、コンテキスト要件に基づいて3D融合を明示的に選択的に制御するゲーテッドルーターを導入し、多様なエンボディドタスクに対するコンテキストを意識した3Dグラウンディングを可能にする。（2）エンボディメントを意識した推論フレームワーク。これは、タスク目標とエンボディメント制約を推論ループに共同で組み込み、目標指向かつ実行可能なプランニング決定を導く。広範な実験結果は、OmniEVAが最先端の一般的なエンボディド推論性能を達成するだけでなく、幅広い下流シナリオにわたる強力な能力を示すことを実証している。提案された一連のエンボディドベンチマーク（基本的および複合タスクを含む）の評価は、その堅牢で多目的なプランニング能力を確認している。プロジェクトページ：https://omnieva.github.io

English

Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible.To address these gaps, we introduce OmniEVA -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io

OmniEVA：タスク適応型3Dグラウンディングと身体性認識推論による多用途プランナー

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

要旨

Support