DreamX-World 1.0：一個通用交互式世界模型

摘要

DreamX-World 1.0 是一個通用型的互動式文字/圖像到影片世界模型，專為可控的長時間序列生成而設計。它支援相機導航、重新造訪先前觀察過的區域，以及在寫實、遊戲風格與風格化領域中進行可提示事件的生成。我們的資料引擎結合了具備準確相機參數的 Unreal Engine 渲染、動作豐富的遊戲錄製片段，以及經過相機幾何恢復的真實世界影片。在相機控制方面，我們引入了 E-PRoPE，這是一種投影位置編碼的輕量級變體，保留了 PRoPE 的投影相機幾何特性，同時對空間縮減後的令牌施加了相機感知注意力。我們利用因果強制、DMD 風格蒸餾以及長展開訓練，將雙向影片生成器轉換為少步驟的自回歸世界模型。在自生成長時序列上下文上進行訓練，使模型接觸到自身生成的歷史，從而減少在自回歸區塊間累積的風格與色彩漂移。記憶條件場景持久性透過基於相機幾何的檢索來恢復較早的視角，而殘差回收則使條件路徑對不完美的記憶潛變量不那麼敏感。事件指令微調加入了可組合的事件控制，而強化學習對齊則在蒸餾後恢復了相機控制與視覺品質。憑藉混合精度的 DiT 執行、殘差重用、75% 剪枝後的 VAE 解碼以及非同步管線並行，DreamX-World 1.0 在八張 RTX 5090 GPU 上可達到高達 16 FPS。在我們五秒鐘的基本評估中，DreamX-World 1.0 獲得了 73.75 的相機控制分數與 84.76 的總體分數，在總體得分上優於 HY-WorldPlay 1.5 與 LingBot-World，後兩者分別獲得 80.79 與 80.45 分。

English

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.