Qwen-RobotWorld 技術報告：通過語言條件化影片生成統一具身世界建模

摘要

我們提出Qwen-RobotWorld，一種面向具身智慧的語言條件化視頻世界模型。它以自然語言作為統一的動作介面，從當前觀測預測跨機器人操作、自動駕駛、室內導航及人機轉移等場景中具備物理基礎的未來視覺軌跡。這種統一的表述提供了三個有前景的應用方向：用於策略訓練增強的合成數據生成、用於策略評估的可擴展虛擬環境，以及用於下游機器人控制的語言引導規劃信號。該成果透過三個部分設計實現：a) 雙流MMDiT與MLLM動作編碼——一個60層雙流擴散Transformer，透過逐層聯合注意力將凍結的Qwen2.5-VL語義與視頻VAE潛變量耦合；b) 具身世界知識——包含860萬個影片文本語料庫（超過2億幀），涵蓋20多種具身型態和500多個動作類別的動作語言映射；c) 通用+專家漸進式課程——一種兩階段訓練策略，先在共享語言介面下學習通用視覺先驗，再注入具身專業化。大量結果顯示其具有強大的競爭力：在EWMBench和DreamGen Bench上整體排名第一，在WorldModelBench和PBench上超越所有開源模型。此外，在RoboTwin-IF基準上的零樣本分析進一步支持其穩健泛化與多視角一致性。

English

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.