MultiCrafter：通過空間解耦注意力與身份感知強化學習實現高保真多主體生成

摘要

多主體圖像生成旨在將用戶提供的主體合成於單一圖像中，同時保持主體的真實性、確保提示一致性，並符合人類審美偏好。然而，現有方法，尤其是基於上下文學習範式的方法，受限於其對簡單重建目標的依賴，導致嚴重的屬性洩漏，損害了主體的真實性，且未能與細膩的人類偏好保持一致。為解決這一問題，我們提出了MultiCrafter框架，以確保高保真且符合偏好的生成。首先，我們發現屬性洩漏的根本原因在於生成過程中不同主體之間的注意力顯著糾纏。因此，我們引入了顯式的位置監督，明確分離每個主體的注意力區域，有效緩解了屬性洩漏。為了使模型能夠在各種場景中準確規劃不同主體的注意力區域，我們採用了專家混合架構來增強模型的能力，使不同專家專注於不同場景。最後，我們設計了一種新穎的在線強化學習框架，使模型與人類偏好保持一致，該框架包含一個評分機制，用於準確評估多主體的真實性，以及一個針對MoE架構量身定制的更穩定的訓練策略。實驗驗證了我們的框架在顯著提升主體真實性的同時，更好地與人類偏好保持一致。

English

Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model's capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.

MultiCrafter：通過空間解耦注意力與身份感知強化學習實現高保真多主體生成

MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning

摘要

Support