ImmerseGen:基于Alpha纹理代理的智能体引导沉浸式世界生成
ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies
June 17, 2025
作者: Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, Yuewen Ma
cs.AI
摘要
数十年来,自动创建用于沉浸式VR体验的3D场景一直是研究的重要焦点。然而,现有方法通常依赖于高多边形网格建模及后续简化或大规模3D高斯分布,导致流程复杂或视觉真实感受限。本文中,我们证明,实现引人入胜的沉浸体验无需如此繁复的建模。我们提出了ImmerseGen,一种新颖的代理引导框架,用于紧凑且逼真的世界建模。ImmerseGen将场景表示为轻量级几何代理(即简化地形和广告牌网格)的层次化组合,并通过在这些代理上合成RGBA纹理来生成逼真的外观。具体而言,我们提出了地形条件纹理化用于以用户为中心的基础世界合成,以及RGBA资产纹理化用于中景和前景场景。这一重构带来了多项优势:(i) 通过让代理引导生成模型生产与场景无缝融合的连贯纹理,简化了建模过程;(ii) 直接在代理上合成逼真纹理,绕过了复杂的几何创建与简化,保持了视觉质量不下降;(iii) 实现了适合移动VR头显实时渲染的紧凑表示。为了从文本提示自动化场景创建,我们引入了基于视觉语言模型(VLM)的建模代理,结合语义网格分析增强空间推理与资产定位的准确性。ImmerseGen还通过动态效果和环境音效丰富场景,支持多感官沉浸。场景生成与实时VR展示的实验表明,ImmerseGen在逼真度、空间一致性和渲染效率上均优于现有方法。项目网页:https://immersegen.github.io。
English
Automatic creation of 3D scenes for immersive VR presence has been a
significant research focus for decades. However, existing methods often rely on
either high-poly mesh modeling with post-hoc simplification or massive 3D
Gaussians, resulting in a complex pipeline or limited visual realism. In this
paper, we demonstrate that such exhaustive modeling is unnecessary for
achieving compelling immersive experience. We introduce ImmerseGen, a novel
agent-guided framework for compact and photorealistic world modeling.
ImmerseGen represents scenes as hierarchical compositions of lightweight
geometric proxies, i.e., simplified terrain and billboard meshes, and generates
photorealistic appearance by synthesizing RGBA textures onto these proxies.
Specifically, we propose terrain-conditioned texturing for user-centric base
world synthesis, and RGBA asset texturing for midground and foreground scenery.
This reformulation offers several advantages: (i) it simplifies modeling by
enabling agents to guide generative models in producing coherent textures that
integrate seamlessly with the scene; (ii) it bypasses complex geometry creation
and decimation by directly synthesizing photorealistic textures on proxies,
preserving visual quality without degradation; (iii) it enables compact
representations suitable for real-time rendering on mobile VR headsets. To
automate scene creation from text prompts, we introduce VLM-based modeling
agents enhanced with semantic grid-based analysis for improved spatial
reasoning and accurate asset placement. ImmerseGen further enriches scenes with
dynamic effects and ambient audio to support multisensory immersion.
Experiments on scene generation and live VR showcases demonstrate that
ImmerseGen achieves superior photorealism, spatial coherence and rendering
efficiency compared to prior methods. Project webpage:
https://immersegen.github.io.