단일 이미지로부터 3D 타운 구축하기

초록

상세한 3D 장면을 획득하려면 일반적으로 고가의 장비, 다중 뷰 데이터, 또는 노동 집약적인 모델링이 필요합니다. 따라서 단일 탑다운 이미지에서 복잡한 3D 장면을 생성하는 경량화된 대안은 실제 응용에서 중요한 역할을 합니다. 최근 3D 생성 모델들은 객체 수준에서 뛰어난 결과를 달성했지만, 이를 전체 장면 생성으로 확장할 경우 일관되지 않은 기하학, 레이아웃 환각, 그리고 저품질 메쉬가 발생하는 문제가 있습니다. 본 연구에서는 단일 탑다운 뷰에서 현실적이고 일관된 3D 장면을 합성하기 위해 훈련이 필요 없는 프레임워크인 3DTown을 소개합니다. 우리의 방법은 두 가지 원칙에 기반합니다: 이미지-3D 정렬과 해상도를 개선하기 위한 영역 기반 생성, 그리고 전역 장면 일관성과 고품질 기하학 생성을 보장하기 위한 공간 인식 3D 인페인팅. 구체적으로, 입력 이미지를 겹치는 영역으로 분해하고 각 영역을 사전 훈련된 3D 객체 생성기를 사용하여 생성한 후, 구조적 연속성을 유지하면서 누락된 기하학을 채우는 마스크된 정류 흐름 인페인팅 프로세스를 적용합니다. 이 모듈식 설계는 해상도 병목 현상을 극복하고 공간 구조를 보존할 수 있게 해주며, 3D 지도나 미세 조정 없이도 가능합니다. 다양한 장면에 걸친 광범위한 실험을 통해 3DTown은 기하학 품질, 공간 일관성, 텍스처 충실도 측면에서 Trellis, Hunyuan3D-2, TripoSG와 같은 최첨단 베이스라인을 능가하는 것으로 나타났습니다. 우리의 결과는 단일 이미지에서도 원칙적이고 훈련이 필요 없는 접근법을 통해 고품질 3D 타운 생성이 가능함을 보여줍니다.

English

Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.