単一画像からの3Dタウンの構築

要旨

詳細な3Dシーンの取得には、通常、高価な機器、マルチビューデータ、または労力を要するモデリングが必要です。そのため、単一のトップダウン画像から複雑な3Dシーンを生成する軽量な代替手段は、実世界のアプリケーションにおいて重要な役割を果たします。最近の3D生成モデルはオブジェクトレベルで顕著な成果を上げていますが、フルシーン生成への拡張では、一貫性のないジオメトリ、レイアウトの幻覚、低品質のメッシュがしばしば発生します。本研究では、単一のトップダウンビューから現実的で一貫性のある3Dシーンを合成するために設計された、トレーニング不要のフレームワークである3DTownを紹介します。私たちの手法は、画像から3Dへの整合性と解像度を向上させるための領域ベースの生成、およびグローバルなシーンの一貫性と高品質なジオメトリ生成を確保するための空間認識型3Dインペインティングという2つの原則に基づいています。具体的には、入力画像を重複する領域に分解し、それぞれを事前学習済みの3Dオブジェクト生成器を使用して生成し、その後、構造的な連続性を維持しながら欠落したジオメトリを埋めるマスクされた修正フローインペインティングプロセスを適用します。このモジュール設計により、解像度のボトルネックを克服し、空間構造を保持することが可能であり、3Dの教師データやファインチューニングを必要としません。多様なシーンにわたる広範な実験により、3DTownは、ジオメトリの品質、空間的一貫性、テクスチャの忠実度において、Trellis、Hunyuan3D-2、TripoSGなどの最先端のベースラインを上回ることが示されました。私たちの結果は、単一の画像から高品質な3Dタウン生成が、原則に基づいたトレーニング不要のアプローチで実現可能であることを示しています。

English

Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.

単一画像からの3Dタウンの構築

Constructing a 3D Town from a Single Image

要旨

Support