ChatPaper.aiChatPaper

LAMIC:基於多模態擴散變換器可擴展性的佈局感知多圖像合成

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

August 1, 2025
作者: Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang
cs.AI

摘要

在可控圖像合成領域,如何基於多個參考圖像並結合空間佈局意識生成連貫且一致的圖像,仍是一個未解的挑戰。本文提出了LAMIC,一種佈局感知的多圖像合成框架,首次以無需訓練的方式將單參考擴散模型擴展至多參考場景。基於MMDiT模型,LAMIC引入了兩種即插即用的注意力機制:1)群組隔離注意力(GIA),以增強實體解耦;2)區域調製注意力(RMA),實現佈局感知生成。為全面評估模型能力,我們進一步引入了三個指標:1)包含率(IN-R)和填充率(FI-R),用於評估佈局控制;2)背景相似度(BG-S),用於衡量背景一致性。大量實驗表明,LAMIC在大多數主要指標上達到了最先進的性能:在所有設置中,它在ID-S、BG-S、IN-R和AVG分數上均優於現有的多參考基線,並在複雜合成任務中取得了最佳的DPG。這些結果展示了LAMIC在身份保持、背景保留、佈局控制和提示跟隨方面的卓越能力,所有這些均無需任何訓練或微調,展現了強大的零樣本泛化能力。通過繼承先進單參考模型的優勢並實現向多圖像場景的無縫擴展,LAMIC為可控多圖像合成建立了一種新的無需訓練的範式。隨著基礎模型的不斷發展,LAMIC的性能預計將相應提升。我們的實現代碼可在以下網址獲取:https://github.com/Suchenl/LAMIC。
English
In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.
PDF62August 6, 2025