LAMIC: Layout-Gestuurde Multi-Beeldcompositie via Schaalbaarheid van Multimodale Diffusie Transformers

Samenvatting

In controleerbare beeldgeneratie blijft het genereren van samenhangende en consistente afbeeldingen vanuit meerdere referenties met bewustzijn van ruimtelijke lay-out een uitdaging. Wij presenteren LAMIC, een Layout-Aware Multi-Image Composition framework dat voor het eerst single-reference diffusiemodellen uitbreidt naar multi-reference scenario's op een trainingsvrije manier. Gebouwd op het MMDiT-model introduceert LAMIC twee plug-and-play aandachtmechanismen: 1) Group Isolation Attention (GIA) om entiteitsontwarring te verbeteren; en 2) Region-Modulated Attention (RMA) om lay-outbewuste generatie mogelijk te maken. Om de modelcapaciteiten uitgebreid te evalueren, introduceren we verder drie metrieken: 1) Inclusion Ratio (IN-R) en Fill Ratio (FI-R) voor het beoordelen van lay-outcontrole; en 2) Background Similarity (BG-S) voor het meten van achtergrondconsistentie. Uitgebreide experimenten tonen aan dat LAMIC state-of-the-art prestaties behaalt op de meeste belangrijke metrieken: het overtreft consistent bestaande multi-reference baselines in ID-S, BG-S, IN-R en AVG-scores in alle instellingen, en behaalt de beste DPG in complexe compositietaken. Deze resultaten demonstreren LAMIC's superieure vermogens in identiteitsbehoud, achtergrondbehoud, lay-outcontrole en prompt-volgen, allemaal bereikt zonder enige training of fine-tuning, wat een sterke zero-shot generalisatiecapaciteit aantoont. Door de sterke punten van geavanceerde single-reference modellen te erven en naadloze uitbreiding naar multi-image scenario's mogelijk te maken, vestigt LAMIC een nieuw trainingsvrij paradigma voor controleerbare multi-image compositie. Naarmate foundationmodellen zich blijven ontwikkelen, wordt verwacht dat LAMIC's prestaties dienovereenkomstig zullen schalen. Onze implementatie is beschikbaar op: https://github.com/Suchenl/LAMIC.

English

In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

LAMIC: Layout-Gestuurde Multi-Beeldcompositie via Schaalbaarheid van Multimodale Diffusie Transformers

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Samenvatting

Support