從單一影像構建3D城鎮

摘要

獲取精細的3D場景通常需要昂貴的設備、多視角數據或耗時的建模工作。因此，一種輕量級的替代方案——從單張俯視圖生成複雜的3D場景，在實際應用中扮演著至關重要的角色。儘管近期的3D生成模型在物體層面取得了顯著成果，但將其擴展至全場景生成時，往往會導致幾何不一致、佈局幻覺以及低質量的網格。在本研究中，我們提出了3DTown，這是一個無需訓練的框架，旨在從單張俯視圖合成真實且連貫的3D場景。我們的方法基於兩大原則：基於區域的生成以提升圖像到3D的對齊與分辨率，以及空間感知的3D修補以確保全局場景的連貫性和高質量的幾何生成。具體而言，我們將輸入圖像分解為重疊的區域，並使用預訓練的3D物體生成器分別生成每個區域，隨後通過掩碼修正流修補過程填補缺失的幾何，同時保持結構的連續性。這種模塊化設計使我們能夠克服分辨率瓶頸，並在無需3D監督或微調的情況下保持空間結構。在各種場景中的廣泛實驗表明，3DTown在幾何質量、空間連貫性和紋理保真度方面均優於包括Trellis、Hunyuan3D-2和TripoSG在內的現有頂尖基準。我們的結果證明，通過一種基於原則、無需訓練的方法，從單張圖像生成高質量的3D城鎮是可行的。

English

Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.