Qwen-RobotWorld技术报告：通过语言条件视频生成统一具身世界建模

摘要

我们提出Qwen-RobotWorld，这是一个面向具身智能的基于语言条件的视频世界模型。该模型以自然语言作为统一动作接口，能够从当前观测中预测物理上合理的未来视觉轨迹，涵盖机器人操作、自动驾驶、室内导航以及人机转移等场景。这种统一公式提供了三个有前景的应用方向：用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境，以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的：a) 双流MMDiT与MLLM动作编码，其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合；b) 具身世界知识（EWK），一个包含860万视频文本语料库（超过2亿帧）的数据集，具有超过20种具身形态和500多个动作类别的动作语言映射；c) 通用+专家渐进课程，这是一种两阶段训练策略，首先学习通用视觉先验，然后在共享语言接口下注入具身专门化。大量结果表明其具有很强的竞争力：在EWMBench和DreamGen Bench上总体排名第一，在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了强大的泛化能力和多视图一致性。

English

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.