從多模態大型語言模型中萃取能力以進行主題驅動生成

摘要

主題驅動影像生成旨在根據文字指令合成新影像，同時保留給定主體的身分特徵。現有方法通常分別編碼文字與參考影像，這限制了跨模態推理能力並導致複製貼上偽影。近期連結多模態模型與擴散模型的框架雖提升了指令遵循能力，卻大幅忽略身分保留。為解決這些限制，我們以多模態大型語言模型（MLLMs）為基礎，對擴散模型進行條件化，該模型可同時編碼文字與參考影像，並加入基於VAE的身分條件化。我們設計新穎的雙層聚合模組，以匯聚多層級MLLM特徵達到最佳條件化效果，並應用多階段去噪策略，在推論過程中逐步平衡來自MLLM的語義資訊與來自VAE的細部身分資訊。大量實驗證明，我們的方法能調和多模態理解與身分保留，減輕複製貼上問題，並在主題驅動影像生成方面達到超越人類偏好的優異表現。我們的專案網站位於https://zsh2000.github.io/squeeze-mllm-subject-gen/。

English

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.