FlowScene：基於多模態圖修正流的風格一致性室內場景生成

摘要

場景生成具有廣泛的工業應用需求，既要求高真實感，又需精確控制幾何結構與外觀。基於語言的檢索方法能從大型物件數據庫中組合出合理的場景，但忽略了物件層級的控制，且往往難以保證場景層級的風格一致性。基於圖結構的建模方法通過顯式關係建模提供了更高的物件可控性與整體一致性，然而現有方法難以生成高保真度的紋理結果，限制了其實用性。我們提出FlowScene——一個基於多模態圖條件的三分支場景生成模型，協同生成場景佈局、物件形狀與物件紋理。其核心是緊耦合的修正流模型，在生成過程中交換物件信息，實現跨圖結構的協同推理。該方法不僅能細粒度控制物件的形狀、紋理與關係，還能確保跨結構與外觀的場景級風格一致性。大量實驗表明，FlowScene在生成真實感、風格一致性及與人類偏好契合度方面，均優於基於語言條件和圖結構條件的基準方法。

English

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.

FlowScene：基於多模態圖修正流的風格一致性室內場景生成

FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

摘要

Support