Scenethesis: 3D 장면 생성을 위한 언어 및 시각 에이전트 프레임워크

초록

텍스트에서 인터랙티브 3D 장면을 합성하는 것은 게임, 가상 현실, 그리고 구현된 AI에 필수적입니다. 그러나 기존 방법들은 여러 가지 도전 과제에 직면해 있습니다. 학습 기반 접근법은 소규모의 실내 데이터셋에 의존하여 장면 다양성과 레이아웃 복잡성을 제한합니다. 대규모 언어 모델(LLM)은 다양한 텍스트 도메인 지식을 활용할 수 있지만, 공간적 현실감에 어려움을 겪으며 종종 상식에 어긋나는 비현실적인 객체 배치를 생성합니다. 우리의 핵심 통찰은 시각적 인식이 LLM이 부족한 현실적인 공간적 지침을 제공함으로써 이 간극을 메울 수 있다는 것입니다. 이를 위해 우리는 LLM 기반 장면 계획과 시각적 지도를 통한 레이아웃 개선을 통합한 학습이 필요 없는 에이전트 프레임워크인 Scenethesis를 소개합니다. 텍스트 프롬프트가 주어지면, Scenethesis는 먼저 LLM을 사용하여 대략적인 레이아웃을 초안으로 작성합니다. 그런 다음 시각 모듈이 이미지 지도를 생성하고 장면 구조를 추출하여 객체 간 관계를 포착함으로써 이를 개선합니다. 다음으로, 최적화 모듈이 반복적으로 정확한 포즈 정렬과 물리적 타당성을 강제하여 객체 침투나 불안정성과 같은 아티팩트를 방지합니다. 마지막으로, 판단 모듈이 공간적 일관성을 검증합니다. 포괄적인 실험 결과, Scenethesis는 다양하고 현실적이며 물리적으로 타당한 3D 인터랙티브 장면을 생성하여 가상 콘텐츠 제작, 시뮬레이션 환경, 그리고 구현된 AI 연구에 가치가 있음을 보여줍니다.

English

Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.