混元世界1.0:从文字或像素生成沉浸式、可探索且交互的三维世界
HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels
July 29, 2025
作者: HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, Yihang Lian, Yulin Tsai, Lifu Wang, Sicong Liu, Puhua Jiang, Xianghui Yang, Dongyuan Guo, Yixuan Tang, Xinyue Mao, Jiaao Yu, Junlin Yu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Chao Zhang, Yonghao Tan, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Minghui Chen, Zhan Li, Wangchen Qin, Lei Wang, Yifu Sun, Lin Niu, Xiang Yuan, Xiaofeng Yang, Yingping He, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Tian Liu, Peng Chen, Di Wang, Yuhong Liu, Linus, Jie Jiang, Tengfei Wang, Chunchao Guo
cs.AI
摘要
從文本或圖像創建沉浸式且可遊玩的3D世界,仍然是計算機視覺與圖形學領域的一個根本性挑戰。現有的世界生成方法通常分為兩類:基於視頻的方法雖提供豐富多樣性,卻缺乏3D一致性和渲染效率;而基於3D的方法雖保證了幾何一致性,卻受限於訓練數據的不足和內存效率低下的表示方式。為解決這些限制,我們提出了HunyuanWorld 1.0,這是一個新穎的框架,它結合了兩者的優勢,能夠從文本和圖像條件生成沉浸式、可探索且互動的3D場景。我們的方法具備三大關鍵優勢:1)通過全景世界代理實現360°沉浸式體驗;2)網格導出功能,確保與現有計算機圖形管線的無縫兼容;3)解耦的物體表示,增強了互動性。我們框架的核心是一個語義分層的3D網格表示,它利用全景圖像作為360°世界代理,進行語義感知的世界分解與重建,從而能夠生成多樣化的3D世界。大量實驗證明,我們的方法在生成連貫、可探索且互動的3D世界方面達到了最先進的水平,同時在虛擬現實、物理模擬、遊戲開發及互動內容創作等領域展現了廣泛的應用潛力。
English
Creating immersive and playable 3D worlds from texts or images remains a
fundamental challenge in computer vision and graphics. Existing world
generation approaches typically fall into two categories: video-based methods
that offer rich diversity but lack 3D consistency and rendering efficiency, and
3D-based methods that provide geometric consistency but struggle with limited
training data and memory-inefficient representations. To address these
limitations, we present HunyuanWorld 1.0, a novel framework that combines the
best of both worlds for generating immersive, explorable, and interactive 3D
scenes from text and image conditions. Our approach features three key
advantages: 1) 360{\deg} immersive experiences via panoramic world proxies; 2)
mesh export capabilities for seamless compatibility with existing computer
graphics pipelines; 3) disentangled object representations for augmented
interactivity. The core of our framework is a semantically layered 3D mesh
representation that leverages panoramic images as 360{\deg} world proxies for
semantic-aware world decomposition and reconstruction, enabling the generation
of diverse 3D worlds. Extensive experiments demonstrate that our method
achieves state-of-the-art performance in generating coherent, explorable, and
interactive 3D worlds while enabling versatile applications in virtual reality,
physical simulation, game development, and interactive content creation.