Robix: ロボットのインタラクション、推論、計画のための統合モデル

要旨

我々は、ロボットの推論、タスク計画、自然言語インタラクションを単一の視覚言語アーキテクチャに統合した統一モデル「Robix」を紹介する。Robixは階層型ロボットシステムの高次認知層として機能し、低レベルコントローラに対するアトミックなコマンドと人間とのインタラクションのための言語応答を動的に生成する。これにより、ロボットは複雑な指示に従い、長期的なタスクを計画し、エンドツーエンドのフレームワーク内で人間と自然に相互作用することが可能となる。Robixはさらに、タスク実行中の能動的な対話、リアルタイムの中断処理、文脈を考慮した常識推論といった新たな機能を導入する。その中核では、Robixは連鎖的思考推論を活用し、3段階のトレーニング戦略を採用している：(1) 3D空間理解、視覚的グラウンディング、タスク中心の推論を含む基礎的な身体化推論能力を強化するための継続的な事前学習、(2) 人間-ロボットインタラクションとタスク計画を統一的な推論-行動シーケンスとしてモデル化するための教師ありファインチューニング、(3) 推論-行動の一貫性と長期的タスクの整合性を向上させるための強化学習。大規模な実験により、Robixがインタラクティブなタスク実行において、オープンソースおよび商用のベースライン（例：GPT-4oやGemini 2.5 Pro）を上回り、多様な指示タイプ（例：オープンエンド、多段階、制約付き、無効、中断）やテーブル片付け、食料品の買い物、食事フィルタリングなどの様々なユーザー関与タスクにおいて強力な汎化性能を示すことが実証された。

English

We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.

Robix: ロボットのインタラクション、推論、計画のための統合モデル

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

要旨

Support