FlowScene：基于多模态图修正流的风格一致室内场景生成

摘要

场景生成具有广泛的工业应用价值，既要求高真实感，又需对几何形状与外观进行精确控制。基于语言的检索方法能够从大规模物体数据库中组合出合理的场景，但忽略了物体层级的控制，且往往难以保证场景层级的风格一致性。基于图的建模方法通过对关系进行显式建模，提供了更高的物体可控性并保障整体一致性，然而现有方法难以生成高保真度的纹理化结果，因而限制了其实用性。我们提出FlowScene——一个基于多模态图的三分支场景生成模型，可协同生成场景布局、物体形状与物体纹理。其核心是一个紧密耦合的修正流模型，通过在生成过程中交换物体信息，实现跨图的协同推理。该模型既能对物体形状、纹理及关系进行细粒度控制，又能确保跨结构与外观的场景层级风格一致性。大量实验表明，FlowScene在生成真实感、风格一致性和与人类偏好匹配度方面均优于基于语言和基于图的基线方法。

English

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.