Qwen-RobotWorld 기술 보고서: 언어 조건부 비디오 생성을 통한 구현된 세계 모델링의 통합

초록

소개합니다: Qwen-RobotWorld - 언어 조건부 비디오 세계 모델의 구현 지능. 자연어를 통일된 행동 인터페이스로 활용하여, 로봇 조작, 자율주행, 실내 항법, 인간-로봇 전이에 이르기까지 현재 관찰로부터 물리적으로 기반한 미래 시각적 궤적을 예측합니다. 이러한 통일된 정식화는 세 가지 유망한 응용 방향을 제공합니다: 정책 훈련 증강을 위한 합성 데이터 생성, 정책 평가를 위한 확장 가능한 가상 환경, 하위 로봇 제어를 위한 언어 기반 계획 신호입니다. 이는 세 부분으로 구성된 설계를 통해 달성됩니다: a) MLLM 행동 인코딩을 갖춘 더블-스트림 MMDiT - 60층 더블-스트림 확산 트랜스포머가 층별 공동 주의 메커니즘을 통해 고정된 Qwen2.5-VL 의미론과 비디오-VAE 잠재 표현을 결합합니다; b) 임베디드 세계 지식(EWK) - 20개 이상의 구현체와 500개 이상의 행동 범주에 걸친 행동-언어 매핑을 포함한 860만 개 비디오-텍스트 코퍼스(2억+ 프레임); c) 일반+전문 점진적 커리큘럼 - 먼저 일반 시각적 사전 지식을 학습하고, 공유 언어 인터페이스 하에서 구현 특화 지식을 주입하는 2단계 훈련 전략. 광범위한 결과에서 강력한 경쟁력 입증: EWMBench 및 DreamGen Bench에서 전체 1위, WorldModelBench 및 PBench에서 모든 오픈소스 모델을 능가. RoboTwin-IF 벤치마크에 대한 추가 제로샷 분석은 강력한 일반화와 다중 뷰 일관성을 추가로 뒷받침합니다.

English

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.