ChatPaper.aiChatPaper

DreamX-World 1.0:通用交互式世界模型

DreamX-World 1.0: A General-Purpose Interactive World Model

June 15, 2026
作者: DreamX Team, Yancheng Bai, Rui Chen, Xiangxiang Chu, Rujing Dang, Hao Dou, Bingjie Gao, Qiwen Gu, Siyu Hong, Jiachen Lei, Geng Li, Jifan Li, Ruimin Lin, Qingfeng Shi, Bingze Song, Lei Sun, Jing Tang, Ruitian Tian, Jun Wang, Jiahong Wu, Pengfei Zhang, Shen Zhang, Jiashu Zhu
cs.AI

摘要

DreamX-World 1.0是一个通用交互式文本/图像到视频的世界模型,专为可控的长时程生成而设计。它支持相机导航、对先前观察区域的重新访问,以及在照片级真实、游戏风格和风格化域中的可提示事件。我们的数据引擎结合了相机精确的虚幻引擎渲染、动作丰富的游戏录制以及带有恢复相机几何结构的真实世界视频。对于相机控制,我们引入了E-PRoPE,这是一种投影位置编码的轻量变体,它保留了PRoPE的投影相机几何结构,同时将相机感知注意力应用于空间降维后的词元。我们利用因果强制、DMD风格蒸馏和长展开训练,将双向视频生成器转换为少步自回归世界模型。在自生成长时程上下文上进行训练,让模型暴露于自身生成的历史中,减少了跨自回归块累积的风格和颜色漂移。基于记忆的场景持久性通过基于相机几何的检索获取早期视图,而残差循环利用使得条件路径对不完美的记忆潜在变量不那么敏感。事件指令微调增添了可组合的事件控制,而强化学习对齐在蒸馏后恢复了相机控制和视觉质量。借助混合精度DiT执行、残差复用、75%剪枝的VAE解码以及异步流水线并行,DreamX-World 1.0在八张RTX 5090 GPU上可达16 FPS。在我们的5秒基础评估中,DreamX-World 1.0获得了73.75的相机控制分数和84.76的总分,在总分上优于HY-WorldPlay 1.5(80.79)和LingBot-World(80.45)。
English
DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.