RoboOmni:全模態情境下的主動式機器人操作
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
October 27, 2025
作者: Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu, Junhao Shi, Kexin Huang, Zhaoye Fei, Jingjing Gong, Zuxuan Wu, Yugang Jiang, See-Kiong Ng, Tat-Seng Chua, Xipeng Qiu
cs.AI
摘要
近年來,多模態大型語言模型(MLLMs)的突破性進展,極大地推動了機器人操作領域中視覺-語言-動作(VLA)模型的快速發展。儘管現有方法在許多場景中表現出色,但它們主要依賴明確指令進行操作,而現實世界中的人類互動極少直接下達指令。要實現高效協作,機器人必須具備主動推斷使用者意圖的能力。為此,我們提出「跨模態情境指令」這一新設定,意圖識別來源轉為語音對話、環境聲響與視覺線索,而非顯性命令。針對此設定,我們推出RoboOmni——基於端到端全模態大型語言模型的感知-思考-對話-執行框架,統一整合意圖識別、交互確認與動作執行。RoboOmni透過時空融合聽覺與視覺訊號實現強健的意圖識別,並支援直接語音交互。為解決機器人操作領域缺乏主動意圖識別訓練數據的問題,我們構建了包含14萬個操作片段、5000+說話者、2400種事件音效、640種背景環境及六類情境指令的OmniAction數據集。模擬與實境實驗表明,RoboOmni在成功率、推理速度、意圖識別與主動輔助方面均優於基於文本和語音識別(ASR)的基準模型。
English
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid
progress in Vision-Language-Action (VLA) models for robotic manipulation.
Although effective in many scenarios, current approaches largely rely on
explicit instructions, whereas in real-world interactions, humans rarely issue
instructions directly. Effective collaboration requires robots to infer user
intentions proactively. In this work, we introduce cross-modal contextual
instructions, a new setting where intent is derived from spoken dialogue,
environmental sounds, and visual cues rather than explicit commands. To address
this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor
framework based on end-to-end omni-modal LLMs that unifies intention
recognition, interaction confirmation, and action execution. RoboOmni fuses
auditory and visual signals spatiotemporally for robust intention recognition,
while supporting direct speech interaction. To address the absence of training
data for proactive intention recognition in robotic manipulation, we build
OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640
backgrounds, and six contextual instruction types. Experiments in simulation
and real-world settings show that RoboOmni surpasses text- and ASR-based
baselines in success rate, inference speed, intention recognition, and
proactive assistance.