OmniEVA：基於任務自適應三維定位與具身感知推理的通用型具身規劃器

摘要

近期，多模态大语言模型（MLLMs）的进展为具身智能开辟了新天地，使其能够实现多模态理解、推理与交互，以及持续的空间决策。然而，当前基于MLLM的具身系统面临两大关键局限。首先，几何适应性差距：仅依赖二维输入训练或硬编码三维几何注入的模型，要么空间信息不足，要么二维泛化受限，导致在应对多样空间需求的任务时适应性欠佳。其次，具身约束差距：先前研究常忽视真实机器人的物理限制与能力，致使任务计划虽理论上可行却实际难以执行。为弥合这些差距，我们推出了OmniEVA——一款具身多功能规划器，通过两项核心创新实现高级具身推理与任务规划：（1）任务自适应的三维基础机制，引入门控路由器，依据上下文需求对三维融合进行显式选择性调控，为多样具身任务提供情境感知的三维基础。（2）具身感知推理框架，将任务目标与具身约束共同纳入推理循环，生成既目标导向又可执行的规划决策。大量实验结果表明，OmniEVA不仅在通用具身推理性能上达到业界领先水平，还在广泛的下游场景中展现出强大能力。对一系列提出的具身基准测试的评估，包括基础与复合任务，均证实了其稳健且多功能的规划能力。项目页面：https://omnieva.github.io

English

Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible.To address these gaps, we introduce OmniEVA -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io

OmniEVA：基於任務自適應三維定位與具身感知推理的通用型具身規劃器

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

摘要

Support