DreamX-World 1.0: 汎用インタラクティブワールドモデル

要旨

DreamX-World 1.0は、制御可能な長期時系列生成を目的とした汎用インタラクティブテキスト／画像から動画への世界モデルである。カメラナビゲーション、過去に観測した領域への再訪、フォトリアリスティック、ゲームスタイル、スタイライズド領域にわたるプロンプト可能なイベントをサポートする。当データエンジンは、カメラ精度の高いUnreal Engineレンダリング、アクション豊富なゲームプレイ記録、および復元されたカメラジオメトリを伴う実世界動画を組み合わせている。カメラ制御には、PRoPEの射影カメラジオメトリを保持しつつ、空間的に削減されたトークンにカメラ認識アテンションを適用する、軽量な射影位置エンコーディングの変種であるE-PRoPEを導入する。双方向動画生成器を、因果強制、DMDスタイル蒸留、および長期ロールアウト学習を用いて、数ステップの自己回帰世界モデルに変換する。自己生成された長期コンテキストでの学習により、モデルは自身の生成履歴にさらされ、自己回帰チャンク間で蓄積されるスタイルおよび色のドリフトが低減される。メモリ条件付きシーンパーシステンスは、カメラジオメトリベースの検索を通じて以前のビューを取得し、残差リサイクルにより条件付け経路を不完全なメモリ潜在変数に対してよりロバストにする。イベント命令チューニングにより合成可能なイベント制御が追加され、強化学習アラインメントにより蒸留後のカメラ制御と視覚品質が回復する。混合精度DiT実行、残差再利用、75%枝刈りされたVAEデコード、および非同期パイプラインパラレリズムにより、DreamX-World 1.0は8枚のRTX 5090 GPU上で最大16FPSを達成する。5秒ベーシック評価では、DreamX-World 1.0はカメラ制御スコア73.75、総合スコア84.76を達成し、それぞれ総合スコア80.79、80.45を達成したHY-WorldPlay 1.5およびLingBot-Worldを上回った。

English

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.