HiScene：基於等距視圖生成的分層三維場景構建

摘要

場景級3D生成代表了多媒體與計算機圖形學中的一個關鍵前沿，然而現有方法要么受限於物體類別，要么缺乏適用於互動應用的編輯靈活性。本文提出HiScene，一種新穎的分層框架，它彌合了2D圖像生成與3D物體生成之間的鴻溝，並能生成具有組合特徵和美學場景內容的高保真場景。我們的核心洞見在於將場景視為等距視角下的分層“物體”，其中房間作為一個複雜物體，可進一步分解為可操控的單元。這種分層方法使我們能夠生成與2D表示對齊的3D內容，同時保持組合結構。為了確保每個分解實例的完整性和空間對齊，我們開發了一種基於視頻擴散的模態補全技術，有效處理物體間的遮擋與陰影，並引入形狀先驗注入以確保場景內的空間一致性。實驗結果表明，我們的方法能產生更自然的物體排列和完整的物體實例，適合互動應用，同時保持物理合理性並與用戶輸入對齊。

English

Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.

HiScene：基於等距視圖生成的分層三維場景構建

HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation

摘要

Support