LAMIC:基于多模态扩散变换器可扩展性的布局感知多图像合成
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
August 1, 2025
作者: Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang
cs.AI
摘要
在可控图像合成领域,从具有空间布局感知的多个参考图像中生成连贯一致的图像仍是一个未解决的挑战。我们提出了LAMIC,一种布局感知的多图像组合框架,首次以无需训练的方式将单参考扩散模型扩展至多参考场景。基于MMDiT模型,LAMIC引入了两种即插即用的注意力机制:1)组隔离注意力(GIA)以增强实体解耦;2)区域调制注意力(RMA)以实现布局感知的生成。为了全面评估模型能力,我们进一步引入了三项指标:1)包含率(IN-R)和填充率(FI-R)用于评估布局控制;2)背景相似度(BG-S)用于衡量背景一致性。大量实验表明,LAMIC在大多数主要指标上均达到了业界领先水平:在所有设置中,它在ID-S、BG-S、IN-R和AVG得分上持续超越现有的多参考基线,并在复杂组合任务中取得了最佳的DPG。这些结果展示了LAMIC在身份保持、背景保留、布局控制及提示跟随方面的卓越能力,且无需任何训练或微调,展现了强大的零样本泛化能力。通过继承先进单参考模型的优势并实现向多图像场景的无缝扩展,LAMIC为可控多图像组合确立了一种新的无需训练范式。随着基础模型的持续演进,LAMIC的性能预期将相应提升。我们的实现代码已发布于:https://github.com/Suchenl/LAMIC。
English
In controllable image synthesis, generating coherent and consistent images
from multiple references with spatial layout awareness remains an open
challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework
that, for the first time, extends single-reference diffusion models to
multi-reference scenarios in a training-free manner. Built upon the MMDiT
model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group
Isolation Attention (GIA) to enhance entity disentanglement; and 2)
Region-Modulated Attention (RMA) to enable layout-aware generation. To
comprehensively evaluate model capabilities, we further introduce three
metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout
control; and 2) Background Similarity (BG-S) for measuring background
consistency. Extensive experiments show that LAMIC achieves state-of-the-art
performance across most major metrics: it consistently outperforms existing
multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all
settings, and achieves the best DPG in complex composition tasks. These results
demonstrate LAMIC's superior abilities in identity keeping, background
preservation, layout control, and prompt-following, all achieved without any
training or fine-tuning, showcasing strong zero-shot generalization ability. By
inheriting the strengths of advanced single-reference models and enabling
seamless extension to multi-image scenarios, LAMIC establishes a new
training-free paradigm for controllable multi-image composition. As foundation
models continue to evolve, LAMIC's performance is expected to scale
accordingly. Our implementation is available at:
https://github.com/Suchenl/LAMIC.