멀티모달 대규모 언어 모델의 역량 극대화를 통한 주제 기반 생성

초록

주제 기반 이미지 생성은 주어진 주체의 정체성을 유지하면서 텍스트 지시를 따르는 새로운 이미지를 합성하는 것을 목표로 한다. 기존 접근법은 종종 텍스트와 참조 이미지를 별도로 인코딩하며, 이는 교차 모달 추론 능력을 제한하고 복사-붙여넣기 인공물을 초래한다. 최근 다중 모달 모델과 확산 모델을 연결하는 프레임워크는 지시 수행 능력을 향상시키지만, 정체성 유지는 대부분 간과한다. 이러한 한계를 해결하기 위해, 우리는 텍스트와 참조 이미지를 공동으로 인코딩하는 다중 모달 대규모 언어 모델(MLLM)에 확산 모델을 조건화하고, VAE 기반 정체성 조건화로 이를 보강한다. 최적의 조건화를 위해 다중 수준 MLLM 특징을 집계하는 새로운 이중 계층 집계(DLA) 모듈을 설계하고, 추론 과정에서 MLLM의 의미 정보와 VAE의 세부 정체성을 점진적으로 균형 맞추기 위해 다단계 잡음 제거 전략을 적용한다. 광범위한 실험을 통해 우리의 접근 방식이 다중 모달 이해와 정체성 유지를 조화시키고, 복사-붙여넣기 문제를 완화하며, 주제 기반 이미지 생성에서 인간 선호도 측면에서 우수한 성능을 달성함을 입증한다. 프로젝트 웹사이트는 https://zsh2000.github.io/squeeze-mllm-subject-gen/에서 확인할 수 있다.

English

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.