LAMIC: 다중 모달 디퓨전 트랜스포머의 확장성을 통한 레이아웃 인식 다중 이미지 합성

초록

제어 가능한 이미지 합성에서 공간 레이아웃 인식을 통해 여러 참조 이미지로부터 일관되고 통일된 이미지를 생성하는 것은 여전히 해결되지 않은 과제입니다. 본 논문에서는 LAMIC(Layout-Aware Multi-Image Composition) 프레임워크를 제안합니다. LAMIC는 단일 참조 확산 모델을 훈련 없이 다중 참조 시나리오로 확장하는 최초의 방법입니다. MMDiT 모델을 기반으로 구축된 LAMIC는 두 가지 플러그 앤 플레이 어텐션 메커니즘을 도입했습니다: 1) 개체 분리를 강화하기 위한 그룹 격리 어텐션(GIA); 2) 레이아웃 인식 생성을 가능하게 하는 영역 변조 어텐션(RMA). 모델의 능력을 종합적으로 평가하기 위해 세 가지 새로운 메트릭을 추가로 제안했습니다: 1) 레이아웃 제어를 평가하기 위한 포함 비율(IN-R) 및 채우기 비율(FI-R); 2) 배경 일관성을 측정하기 위한 배경 유사도(BG-S). 광범위한 실험 결과, LAMIC는 대부분의 주요 메트릭에서 최첨단 성능을 달성했습니다. 모든 설정에서 ID-S, BG-S, IN-R 및 AVG 점수에서 기존 다중 참조 베이스라인을 꾸준히 능가했으며, 복잡한 합성 작업에서 최고의 DPG를 달성했습니다. 이러한 결과는 LAMIC가 훈련이나 미세 조정 없이도 정체성 유지, 배경 보존, 레이아웃 제어, 프롬프트 준수 등에서 우수한 능력을 보여주며, 강력한 제로샷 일반화 능력을 입증합니다. 고급 단일 참조 모델의 강점을 계승하고 다중 이미지 시나리오로의 원활한 확장을 가능하게 함으로써, LAMIC는 제어 가능한 다중 이미지 합성을 위한 새로운 훈련 없는 패러다임을 확립했습니다. 기초 모델이 계속 발전함에 따라 LAMIC의 성능도 그에 따라 확장될 것으로 기대됩니다. 구현 코드는 https://github.com/Suchenl/LAMIC에서 확인할 수 있습니다.

English

In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

LAMIC: 다중 모달 디퓨전 트랜스포머의 확장성을 통한 레이아웃 인식 다중 이미지 합성

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

초록

Support