ChatPaper.aiChatPaper

Qwen-RobotWorld 技術報告:通過語言條件化影片生成統一具身世界建模

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

June 15, 2026
作者: Jie Zhang, Xiaoyue Chen, Anzhe Chen, Chenxu Lv, Deqing Li, Gengze Zhou, Hang Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zhixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Xiong-Hui Chen, Chenfei Wu
cs.AI

摘要

我們提出Qwen-RobotWorld,一種面向具身智慧的語言條件化視頻世界模型。它以自然語言作為統一的動作介面,從當前觀測預測跨機器人操作、自動駕駛、室內導航及人機轉移等場景中具備物理基礎的未來視覺軌跡。這種統一的表述提供了三個有前景的應用方向:用於策略訓練增強的合成數據生成、用於策略評估的可擴展虛擬環境,以及用於下游機器人控制的語言引導規劃信號。該成果透過三個部分設計實現:a) 雙流MMDiT與MLLM動作編碼——一個60層雙流擴散Transformer,透過逐層聯合注意力將凍結的Qwen2.5-VL語義與視頻VAE潛變量耦合;b) 具身世界知識——包含860萬個影片文本語料庫(超過2億幀),涵蓋20多種具身型態和500多個動作類別的動作語言映射;c) 通用+專家漸進式課程——一種兩階段訓練策略,先在共享語言介面下學習通用視覺先驗,再注入具身專業化。大量結果顯示其具有強大的競爭力:在EWMBench和DreamGen Bench上整體排名第一,在WorldModelBench和PBench上超越所有開源模型。此外,在RoboTwin-IF基準上的零樣本分析進一步支持其穩健泛化與多視角一致性。
English
We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.