LAMIC：基於多模態擴散變換器可擴展性的佈局感知多圖像合成

摘要

在可控圖像合成領域，如何基於多個參考圖像並結合空間佈局意識生成連貫且一致的圖像，仍是一個未解的挑戰。本文提出了LAMIC，一種佈局感知的多圖像合成框架，首次以無需訓練的方式將單參考擴散模型擴展至多參考場景。基於MMDiT模型，LAMIC引入了兩種即插即用的注意力機制：1）群組隔離注意力（GIA），以增強實體解耦；2）區域調製注意力（RMA），實現佈局感知生成。為全面評估模型能力，我們進一步引入了三個指標：1）包含率（IN-R）和填充率（FI-R），用於評估佈局控制；2）背景相似度（BG-S），用於衡量背景一致性。大量實驗表明，LAMIC在大多數主要指標上達到了最先進的性能：在所有設置中，它在ID-S、BG-S、IN-R和AVG分數上均優於現有的多參考基線，並在複雜合成任務中取得了最佳的DPG。這些結果展示了LAMIC在身份保持、背景保留、佈局控制和提示跟隨方面的卓越能力，所有這些均無需任何訓練或微調，展現了強大的零樣本泛化能力。通過繼承先進單參考模型的優勢並實現向多圖像場景的無縫擴展，LAMIC為可控多圖像合成建立了一種新的無需訓練的範式。隨著基礎模型的不斷發展，LAMIC的性能預計將相應提升。我們的實現代碼可在以下網址獲取：https://github.com/Suchenl/LAMIC。

English

In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

LAMIC：基於多模態擴散變換器可擴展性的佈局感知多圖像合成

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

摘要

Support