場景構建:面向三維場景生成的語言與視覺代理框架
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
May 5, 2025
作者: Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, Zhaoshuo Li
cs.AI
摘要
從文本合成互動式3D場景對於遊戲、虛擬實境及具身人工智慧至關重要。然而,現有方法面臨多項挑戰。基於學習的方法依賴於小規模的室內數據集,限制了場景的多樣性與佈局的複雜性。儘管大型語言模型(LLMs)能夠利用多樣的文本領域知識,但在空間真實性方面表現欠佳,常產生違背常識的物件擺放,顯得極不自然。我們的核心洞見在於,視覺感知能夠彌補這一鴻溝,提供LLMs所缺乏的真實空間指引。為此,我們引入了Scenethesis,這是一個無需訓練的代理框架,它將基於LLM的場景規劃與視覺引導的佈局優化相結合。給定一段文本提示,Scenethesis首先利用LLM草擬一個粗略的佈局。隨後,視覺模組通過生成圖像指引並提取場景結構來捕捉物件間的關係,對佈局進行細化。接著,優化模組迭代地實施精確的姿態對齊與物理合理性,防止物件穿透及不穩定等異常現象。最後,評判模組驗證空間的一致性。全面的實驗表明,Scenethesis能夠生成多樣、真實且物理上合理的3D互動場景,這對於虛擬內容創作、模擬環境及具身AI研究具有重要價值。
English
Synthesizing interactive 3D scenes from text is essential for gaming, virtual
reality, and embodied AI. However, existing methods face several challenges.
Learning-based approaches depend on small-scale indoor datasets, limiting the
scene diversity and layout complexity. While large language models (LLMs) can
leverage diverse text-domain knowledge, they struggle with spatial realism,
often producing unnatural object placements that fail to respect common sense.
Our key insight is that vision perception can bridge this gap by providing
realistic spatial guidance that LLMs lack. To this end, we introduce
Scenethesis, a training-free agentic framework that integrates LLM-based scene
planning with vision-guided layout refinement. Given a text prompt, Scenethesis
first employs an LLM to draft a coarse layout. A vision module then refines it
by generating an image guidance and extracting scene structure to capture
inter-object relations. Next, an optimization module iteratively enforces
accurate pose alignment and physical plausibility, preventing artifacts like
object penetration and instability. Finally, a judge module verifies spatial
coherence. Comprehensive experiments show that Scenethesis generates diverse,
realistic, and physically plausible 3D interactive scenes, making it valuable
for virtual content creation, simulation environments, and embodied AI
research.Summary
AI-Generated Summary