MultiCrafter：通过空间解耦注意力与身份感知强化学习实现高保真多主体生成

摘要

多主体图像生成旨在将用户提供的多个主体合成至单一图像中，同时保持主体保真度、确保提示一致性，并符合人类审美偏好。然而，现有方法，尤其是基于上下文学习范式的技术，受限于其依赖简单重建目标，导致严重的属性泄露，损害了主体保真度，且未能与细致的人类偏好对齐。为此，我们提出了MultiCrafter框架，以实现高保真、偏好对齐的生成。首先，我们发现属性泄露的根本原因在于生成过程中不同主体间注意力显著纠缠。因此，我们引入了显式的位置监督，明确分离每个主体的注意力区域，有效缓解了属性泄露。为了使模型能在多样场景中准确规划不同主体的注意力区域，我们采用了专家混合架构来增强模型能力，让不同专家专注于不同场景。最后，我们设计了一种新颖的在线强化学习框架，使模型与人类偏好对齐，该框架包含一个评分机制以精确评估多主体保真度，以及一个为MoE架构量身定制的更稳定的训练策略。实验验证了我们的框架在显著提升主体保真度的同时，更好地与人类偏好保持一致。

English

Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model's capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.

MultiCrafter：通过空间解耦注意力与身份感知强化学习实现高保真多主体生成

MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning

摘要

Support