从多模态大语言模型中挖掘能力用于主体驱动生成

摘要

主体驱动图像生成旨在根据文本指令合成保留给定主体身份的新图像。现有方法通常将文本与参考图像分开编码，这限制了跨模态推理能力，并导致复制粘贴伪影。近期连接多模态模型与扩散模型的框架虽提升了指令遵循能力，但很大程度上忽视了身份保持。为解决这些局限，我们以联合编码文本与参考图像的多模态大语言模型（MLLM）为条件构建扩散模型，并引入基于VAE的身份条件控制进行增强。设计了新型双层聚合（DLA）模块，用于聚合多层级MLLM特征以实现最优条件控制；同时采用多阶段去噪策略，在推理过程中逐步平衡来自MLLM的语义信息与VAE提供的细粒度身份信息。大量实验表明，本方法在主体驱动图像生成任务中实现了多模态理解与身份保持的协调统一，有效缓解了复制粘贴问题，并在人类偏好评估中展现出更优性能。项目网站见 https://zsh2000.github.io/squeeze-mllm-subject-gen/。

English

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.