HY-World 2.0：3D世界の再構築、生成、シミュレーションのためのマルチモーダル世界モデル

要旨

我々は、従来のプロジェクトHY-World 1.0を発展させたマルチモーダル世界モデルフレームワーク「HY-World 2.0」を提案する。HY-World 2.0は、テキストプロンプト、単一視点画像、多視点画像、動画など多様な入力モダリティに対応し、3D世界表現を生成する。テキストまたは単一視点画像を入力とした場合、モデルは世界生成を行い、高精細でナビゲーション可能な3D Gaussian Splatting（3DGS）シーンを合成する。これは4段階の手法で実現される：（a）HY-Pano 2.0によるパノラマ生成、（b）WorldNavによる軌道計画、（c）WorldStereo 2.0による世界拡張、（d）WorldMirror 2.0による世界合成。具体的には、パノラマの忠実度向上、3Dシーン理解と計画の実現、一貫性メモリを備えたキーフレームベースの視点生成モデルであるWorldStereoのアップグレードといった重要な革新を導入する。また、フィードフォワード型の普遍的な3D予測モデルであるWorldMirrorについて、モデル構造と学習戦略を改良し、多視点画像や動画からの世界再構築を可能にするアップグレードを行う。さらに、エンジンに依存しない柔軟なアーキテクチャ、自動IBLライティング、効率的な衝突検出、トレーニングとレンダリングの協調設計を特徴とする高性能3DGSレンダリングプラットフォーム「WorldLens」を導入し、キャラクター対応のインタラクティブな3D世界探索を実現する。大規模な実験により、HY-World 2.0がオープンソース手法において複数のベンチマークで最先端の性能を達成し、クローズドソースモデルMarbleに匹敵する結果をもたらすことが実証された。再現性を確保し3D世界モデルの研究を促進するため、全てのモデル重み、コード、技術詳細を公開する。

English

We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.

HY-World 2.0：3D世界の再構築、生成、シミュレーションのためのマルチモーダル世界モデル

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

要旨

Support