로빅스: 로봇 상호작용, 추론 및 계획을 위한 통합 모델

초록

우리는 로봇 추론, 작업 계획, 자연어 상호작용을 단일 비전-언어 아키텍처 내에 통합한 통합 모델인 Robix를 소개합니다. Robix는 계층적 로봇 시스템의 고수준 인지 계층으로 작동하며, 저수준 제어기를 위한 원자적 명령과 인간 상호작용을 위한 언어적 응답을 동적으로 생성함으로써, 로봇이 복잡한 지시를 따르고, 장기적인 작업을 계획하며, 인간과 자연스럽게 상호작용할 수 있도록 하는 엔드투엔드 프레임워크를 제공합니다. Robix는 또한 작업 실행 중 사전적 대화, 실시간 중단 처리, 상황 인식 상식 추론과 같은 새로운 기능을 도입합니다. Robix의 핵심은 사고의 연쇄(chain-of-thought) 추론을 활용하며, 3단계 훈련 전략을 채택합니다: (1) 3D 공간 이해, 시각적 접지, 작업 중심 추론을 포함한 기본적인 구현체 추론 능력을 강화하기 위한 지속적 사전 훈련; (2) 인간-로봇 상호작용과 작업 계획을 통합된 추론-행동 시퀀스로 모델링하기 위한 지도 미세 조정; (3) 추론-행동 일관성과 장기 작업 일관성을 개선하기 위한 강화 학습. 광범위한 실험을 통해 Robix는 상호작용 작업 실행에서 오픈소스 및 상용 베이스라인(예: GPT-4o 및 Gemini 2.5 Pro)을 능가하며, 다양한 지시 유형(예: 개방형, 다단계, 제약적, 무효, 중단)과 테이블 정리, 식료품 쇼핑, 식이 필터링과 같은 다양한 사용자 참여 작업에서 강력한 일반화 능력을 보여줍니다.

English

We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.

로빅스: 로봇 상호작용, 추론 및 계획을 위한 통합 모델

Robix: A Unified Model for Robot Interaction, Reasoning and Planning

초록

Support