RoboOmni:全模态情境下的主动式机器人操作
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
October 27, 2025
作者: Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu, Junhao Shi, Kexin Huang, Zhaoye Fei, Jingjing Gong, Zuxuan Wu, Yugang Jiang, See-Kiong Ng, Tat-Seng Chua, Xipeng Qiu
cs.AI
摘要
近期,多模态大语言模型(MLLM)的突破性进展推动了机器人操作领域视觉-语言-动作(VLA)模型的快速发展。尽管现有方法在许多场景中表现优异,但它们主要依赖显式指令进行交互,而现实世界中人类很少直接下达指令。要实现高效协作,机器人需具备主动推断用户意图的能力。本文提出跨模态情境指令这一新范式,其核心在于通过语音对话、环境声音与视觉线索(而非显式命令)来推导用户意图。为应对这一新范式,我们推出RoboOmni——基于端到端全模态大语言模型的感知-思考-对话-执行框架,该框架统一了意图识别、交互确认与动作执行三大功能。RoboOmni通过时空融合听觉与视觉信号实现鲁棒的意图识别,并支持直接语音交互。针对机器人操作中主动意图识别训练数据的缺失,我们构建了包含14万条操作序列、5000+说话人、2400种事件声音、640种背景及六类情境指令的OmniAction数据集。仿真与实体实验表明,RoboOmni在任务成功率、推理速度、意图识别准确率和主动协助能力上均超越基于文本与自动语音识别(ASR)的基线模型。
English
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid
progress in Vision-Language-Action (VLA) models for robotic manipulation.
Although effective in many scenarios, current approaches largely rely on
explicit instructions, whereas in real-world interactions, humans rarely issue
instructions directly. Effective collaboration requires robots to infer user
intentions proactively. In this work, we introduce cross-modal contextual
instructions, a new setting where intent is derived from spoken dialogue,
environmental sounds, and visual cues rather than explicit commands. To address
this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor
framework based on end-to-end omni-modal LLMs that unifies intention
recognition, interaction confirmation, and action execution. RoboOmni fuses
auditory and visual signals spatiotemporally for robust intention recognition,
while supporting direct speech interaction. To address the absence of training
data for proactive intention recognition in robotic manipulation, we build
OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640
backgrounds, and six contextual instruction types. Experiments in simulation
and real-world settings show that RoboOmni surpasses text- and ASR-based
baselines in success rate, inference speed, intention recognition, and
proactive assistance.