HY-World 2.0:用于重建、生成与模拟三维世界的多模态世界模型
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
April 15, 2026
作者: Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, Chao Zhang, Coopers Li, Dongyuan Guo, Fan Yang, Haiyu Zhang, Hang Cao, Jianchen Zhu, Jiaxin Lin, Jie Xiao, Jihong Zhang, Junlin Yu, Lei Wang, Lifu Wang, Lilin Wang, Linus, Minghui Chen, Peng He, Penghao Zhao, Qi Chen, Rui Chen, Rui Shao, Sicong Liu, Wangchen Qin, Xiaochuan Niu, Xiang Yuan, Yi Sun, Yifei Tang, Yifu Sun, Yihang Lian, Yonghao Tan, Yuhong Liu, Yuyang Yin, Zhiyuan Min, Tengfei Wang, Chunchao Guo
cs.AI
摘要
我们推出HY-World 2.0——一个多模态世界模型框架,该框架在我们先前项目HY-World 1.0基础上实现了重大升级。HY-World 2.0支持包括文本提示、单视角图像、多视角图像和视频在内的多样化输入模态,并能生成3D世界表征。当输入文本或单视角图像时,该模型可执行世界生成任务,合成具有高保真度、可导航的3D高斯泼溅(3DGS)场景。这一过程通过四阶段方法实现:a) 使用HY-Pano 2.0进行全景生成,b) 通过WorldNav进行轨迹规划,c) 采用WorldStereo 2.0进行世界扩展,d) 利用WorldMirror 2.0完成世界合成。具体而言,我们引入了关键创新技术以提升全景保真度、实现3D场景理解与规划,并对基于关键帧的视图生成模型WorldStereo进行了具有一致性记忆的升级。同时,我们通过优化模型架构与学习策略,对通用3D预测的前馈模型WorldMirror进行升级,使其能够从多视角图像或视频中完成世界重建。此外,我们还推出了WorldLens——一个高性能3DGS渲染平台,其采用灵活的引擎无关架构,具备自动图像光照(IBL)、高效碰撞检测及训练-渲染协同设计等特性,支持带角色交互的3D世界探索。大量实验表明,HY-World 2.0在开源方案中的多个基准测试上达到了最先进性能,其效果可与闭源模型Marble相媲美。我们公开了全部模型权重、代码及技术细节,以促进可复现性并支持3D世界模型的进一步研究。
English
We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.