ChatPaper.aiChatPaper

Robix:機器人交互、推理與規劃的統一模型

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

September 1, 2025
作者: Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li
cs.AI

摘要

我們介紹了Robix,這是一個統一模型,將機器人推理、任務規劃和自然語言互動整合於單一的視覺-語言架構中。作為分層機器人系統中的高層認知層,Robix動態生成原子指令供低層控制器使用,並生成語言回應以實現人機互動,使機器人能夠在端到端框架內遵循複雜指令、規劃長期任務,並與人類自然互動。Robix進一步引入了新穎功能,如主動對話、實時中斷處理以及在任務執行期間的上下文感知常識推理。Robix的核心利用了思維鏈推理,並採用了三階段訓練策略:(1) 持續預訓練以增強基礎的具身推理能力,包括3D空間理解、視覺接地和任務中心推理;(2) 監督微調,將人機互動和任務規劃建模為統一的推理-行動序列;(3) 強化學習,以提高推理-行動的一致性和長期任務的連貫性。大量實驗表明,Robix在互動任務執行方面優於開源和商業基準(如GPT-4o和Gemini 2.5 Pro),展示了在各種指令類型(如開放式、多階段、約束性、無效和中斷)以及多種用戶參與任務(如餐桌清理、雜貨購物和飲食過濾)上的強大泛化能力。
English
We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.
PDF486September 4, 2025