Robix:机器人交互、推理与规划的统一模型
Robix: A Unified Model for Robot Interaction, Reasoning and Planning
September 1, 2025
作者: Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li
cs.AI
摘要
我们推出Robix,这是一个统一模型,将机器人推理、任务规划和自然语言交互集成于单一视觉-语言架构中。作为分层机器人系统中的高层认知模块,Robix动态生成原子指令供底层控制器执行,同时产生语言响应以支持人机交互,使机器人能够在端到端框架下遵循复杂指令、规划长期任务,并与人类自然互动。Robix进一步引入了新颖功能,如主动对话、实时中断处理以及在任务执行过程中的情境感知常识推理。其核心在于利用思维链推理,并采用三阶段训练策略:(1)持续预训练,以增强包括三维空间理解、视觉定位和任务中心推理在内的基础具身推理能力;(2)监督微调,将人机交互和任务规划建模为统一的推理-行动序列;(3)强化学习,以提高推理-行动的一致性和长期任务的连贯性。大量实验表明,Robix在交互式任务执行上超越了开源和商业基线模型(如GPT-4o和Gemini 2.5 Pro),展现出对多种指令类型(如开放式、多阶段、受限、无效和中断)以及涉及用户的各种任务(如餐桌清理、购物和饮食筛选)的强大泛化能力。
English
We introduce Robix, a unified model that integrates robot reasoning, task
planning, and natural language interaction within a single vision-language
architecture. Acting as the high-level cognitive layer in a hierarchical robot
system, Robix dynamically generates atomic commands for the low-level
controller and verbal responses for human interaction, enabling robots to
follow complex instructions, plan long-horizon tasks, and interact naturally
with human within an end-to-end framework. Robix further introduces novel
capabilities such as proactive dialogue, real-time interruption handling, and
context-aware commonsense reasoning during task execution. At its core, Robix
leverages chain-of-thought reasoning and adopts a three-stage training
strategy: (1) continued pretraining to enhance foundational embodied reasoning
abilities including 3D spatial understanding, visual grounding, and
task-centric reasoning; (2) supervised finetuning to model human-robot
interaction and task planning as a unified reasoning-action sequence; and (3)
reinforcement learning to improve reasoning-action consistency and long-horizon
task coherence. Extensive experiments demonstrate that Robix outperforms both
open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in
interactive task execution, demonstrating strong generalization across diverse
instruction types (e.g., open-ended, multi-stage, constrained, invalid, and
interrupted) and various user-involved tasks such as table bussing, grocery
shopping, and dietary filtering.