Robix:機器人交互、推理與規劃的統一模型
Robix: A Unified Model for Robot Interaction, Reasoning and Planning
September 1, 2025
作者: Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li
cs.AI
摘要
我們介紹了Robix,這是一個統一模型,將機器人推理、任務規劃和自然語言互動整合於單一的視覺-語言架構中。作為分層機器人系統中的高層認知層,Robix動態生成原子指令供低層控制器使用,並生成語言回應以實現人機互動,使機器人能夠在端到端框架內遵循複雜指令、規劃長期任務,並與人類自然互動。Robix進一步引入了新穎功能,如主動對話、實時中斷處理以及在任務執行期間的上下文感知常識推理。Robix的核心利用了思維鏈推理,並採用了三階段訓練策略:(1) 持續預訓練以增強基礎的具身推理能力,包括3D空間理解、視覺接地和任務中心推理;(2) 監督微調,將人機互動和任務規劃建模為統一的推理-行動序列;(3) 強化學習,以提高推理-行動的一致性和長期任務的連貫性。大量實驗表明,Robix在互動任務執行方面優於開源和商業基準(如GPT-4o和Gemini 2.5 Pro),展示了在各種指令類型(如開放式、多階段、約束性、無效和中斷)以及多種用戶參與任務(如餐桌清理、雜貨購物和飲食過濾)上的強大泛化能力。
English
We introduce Robix, a unified model that integrates robot reasoning, task
planning, and natural language interaction within a single vision-language
architecture. Acting as the high-level cognitive layer in a hierarchical robot
system, Robix dynamically generates atomic commands for the low-level
controller and verbal responses for human interaction, enabling robots to
follow complex instructions, plan long-horizon tasks, and interact naturally
with human within an end-to-end framework. Robix further introduces novel
capabilities such as proactive dialogue, real-time interruption handling, and
context-aware commonsense reasoning during task execution. At its core, Robix
leverages chain-of-thought reasoning and adopts a three-stage training
strategy: (1) continued pretraining to enhance foundational embodied reasoning
abilities including 3D spatial understanding, visual grounding, and
task-centric reasoning; (2) supervised finetuning to model human-robot
interaction and task planning as a unified reasoning-action sequence; and (3)
reinforcement learning to improve reasoning-action consistency and long-horizon
task coherence. Extensive experiments demonstrate that Robix outperforms both
open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in
interactive task execution, demonstrating strong generalization across diverse
instruction types (e.g., open-ended, multi-stage, constrained, invalid, and
interrupted) and various user-involved tasks such as table bussing, grocery
shopping, and dietary filtering.