ImmerseGen:基於Alpha紋理代理的智能體導向沉浸式世界生成
ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies
June 17, 2025
作者: Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, Yuewen Ma
cs.AI
摘要
數十年來,自動創建三維場景以實現沉浸式虛擬現實(VR)體驗一直是研究的重要焦點。然而,現有方法通常依賴於高多邊形網格建模並進行後續簡化,或使用大量三維高斯分佈,導致流程複雜或視覺真實感有限。本文論證,此類繁瑣的建模對於實現引人入勝的沉浸式體驗並非必要。我們提出了ImmerseGen,一種新穎的代理引導框架,用於構建緊湊且逼真的世界模型。ImmerseGen將場景表示為輕量級幾何代理(即簡化的地形和廣告牌網格)的層次化組合,並通過在這些代理上合成RGBA紋理來生成逼真的外觀。具體而言,我們提出了基於地形的紋理生成方法,用於以用戶為中心的基礎世界合成,以及RGBA資產紋理生成方法,用於中景和前景場景。這一重構具有以下優勢:(i)通過讓代理引導生成模型生成與場景無縫融合的連貫紋理,簡化了建模過程;(ii)繞過了複雜的幾何創建與簡化,直接在代理上合成逼真紋理,保持視覺質量不下降;(iii)支持適合移動VR頭顯實時渲染的緊湊表示。為實現從文本提示自動生成場景,我們引入了基於視覺語言模型(VLM)的建模代理,並增強了基於語義網格的分析,以提升空間推理能力與資產放置的準確性。ImmerseGen還通過動態效果與環境音效豐富場景,支持多感官沉浸。場景生成與實時VR展示的實驗表明,與先前方法相比,ImmerseGen在逼真度、空間一致性及渲染效率上均表現出優越性。項目網頁:https://immersegen.github.io。
English
Automatic creation of 3D scenes for immersive VR presence has been a
significant research focus for decades. However, existing methods often rely on
either high-poly mesh modeling with post-hoc simplification or massive 3D
Gaussians, resulting in a complex pipeline or limited visual realism. In this
paper, we demonstrate that such exhaustive modeling is unnecessary for
achieving compelling immersive experience. We introduce ImmerseGen, a novel
agent-guided framework for compact and photorealistic world modeling.
ImmerseGen represents scenes as hierarchical compositions of lightweight
geometric proxies, i.e., simplified terrain and billboard meshes, and generates
photorealistic appearance by synthesizing RGBA textures onto these proxies.
Specifically, we propose terrain-conditioned texturing for user-centric base
world synthesis, and RGBA asset texturing for midground and foreground scenery.
This reformulation offers several advantages: (i) it simplifies modeling by
enabling agents to guide generative models in producing coherent textures that
integrate seamlessly with the scene; (ii) it bypasses complex geometry creation
and decimation by directly synthesizing photorealistic textures on proxies,
preserving visual quality without degradation; (iii) it enables compact
representations suitable for real-time rendering on mobile VR headsets. To
automate scene creation from text prompts, we introduce VLM-based modeling
agents enhanced with semantic grid-based analysis for improved spatial
reasoning and accurate asset placement. ImmerseGen further enriches scenes with
dynamic effects and ambient audio to support multisensory immersion.
Experiments on scene generation and live VR showcases demonstrate that
ImmerseGen achieves superior photorealism, spatial coherence and rendering
efficiency compared to prior methods. Project webpage:
https://immersegen.github.io.