HiScene: 等角投影生成による階層的3Dシーンの構築

要旨

シーンレベルの3D生成は、マルチメディアとコンピュータグラフィックスにおける重要なフロンティアである。しかし、既存のアプローチでは、オブジェクトのカテゴリが限られているか、インタラクティブアプリケーションのための編集柔軟性が欠如している。本論文では、2D画像生成と3Dオブジェクト生成のギャップを埋め、構成要素の識別性と美的シーンコンテンツを備えた高精細なシーンを提供する、新しい階層的フレームワークであるHiSceneを提案する。我々の重要な洞察は、シーンを等角投影図における階層的な「オブジェクト」として扱うことであり、部屋をさらに操作可能なアイテムに分解できる複雑なオブジェクトとして機能させる。この階層的アプローチにより、2D表現と整合する3Dコンテンツを生成しつつ、構成構造を維持することが可能となる。各分解インスタンスの完全性と空間的整合性を確保するために、オクルージョンと影を効果的に処理するビデオ拡散ベースのアモーダル補完技術を開発し、シーン内の空間的整合性を保証する形状事前注入を導入する。実験結果は、我々の手法が物理的な妥当性とユーザー入力との整合性を維持しつつ、インタラクティブアプリケーションに適したより自然なオブジェクト配置と完全なオブジェクトインスタンスを生成することを示している。

English

Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.