HunyuanWorld 1.0: 단어 또는 픽셀에서 몰입적이고 탐색 가능하며 상호작용 가능한 3D 세계 생성

초록

텍스트나 이미지로부터 몰입적이고 플레이 가능한 3D 세계를 생성하는 것은 컴퓨터 비전과 그래픽스 분야에서 여전히 근본적인 과제로 남아 있습니다. 기존의 세계 생성 접근법은 일반적으로 두 가지 범주로 나뉩니다: 풍부한 다양성을 제공하지만 3D 일관성과 렌더링 효율성이 부족한 비디오 기반 방법과, 기하학적 일관성을 제공하지만 제한된 학습 데이터와 메모리 비효율적인 표현으로 어려움을 겪는 3D 기반 방법입니다. 이러한 한계를 해결하기 위해, 우리는 텍스트와 이미지 조건으로부터 몰입적이고 탐색 가능하며 상호작용 가능한 3D 장면을 생성하기 위해 두 가지 접근법의 장점을 결합한 새로운 프레임워크인 HunyuanWorld 1.0을 제안합니다. 우리의 접근법은 세 가지 주요 장점을 특징으로 합니다: 1) 파노라마 세계 프록시를 통한 360도 몰입형 경험; 2) 기존 컴퓨터 그래픽스 파이프라인과의 원활한 호환성을 위한 메시 내보내기 기능; 3) 향상된 상호작용을 위한 분리된 객체 표현. 우리 프레임워크의 핵심은 파노라마 이미지를 360도 세계 프록시로 활용하여 의미론적으로 계층화된 3D 메시 표현을 통해 다양한 3D 세계를 생성할 수 있도록 하는 의미론적 세계 분해 및 재구성입니다. 광범위한 실험을 통해 우리의 방법이 일관성 있고 탐색 가능하며 상호작용 가능한 3D 세계를 생성하는 데 있어 최첨단 성능을 달성함과 동시에 가상 현실, 물리 시뮬레이션, 게임 개발, 그리고 인터랙티브 콘텐츠 제작 등 다양한 응용 분야에서 활용 가능함을 입증했습니다.

English

Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360{\deg} immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360{\deg} world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.