MultiCrafter: 공간적으로 분리된 주의 메커니즘과 아이덴티티 인식 강화 학습을 통한 고품질 다중 주체 생성

초록

다중 주체 이미지 생성은 사용자가 제공한 여러 주체를 단일 이미지 내에서 합성하면서 주체 충실도를 유지하고, 프롬프트 일관성을 보장하며, 인간의 미적 선호도에 부합하는 것을 목표로 합니다. 그러나 기존 방법론, 특히 In-Context-Learning 패러다임에 기반한 방법들은 단순한 재구성 기반 목표에 의존함으로써 심각한 속성 누출로 인해 주체 충실도가 저하되고, 인간의 미묘한 선호도에 부합하지 못하는 한계를 보입니다. 이를 해결하기 위해, 우리는 고충실도와 선호도에 부합하는 생성을 보장하는 MultiCrafter 프레임워크를 제안합니다. 먼저, 우리는 속성 누출의 근본 원인이 생성 과정에서 서로 다른 주체 간의 주의 영역이 심하게 얽혀 있기 때문임을 발견했습니다. 따라서, 각 주체의 주의 영역을 명시적으로 분리하기 위해 명시적인 위치 감독을 도입하여 속성 누출을 효과적으로 완화합니다. 또한, 모델이 다양한 시나리오에서 서로 다른 주체의 주의 영역을 정확히 계획할 수 있도록, Mixture-of-Experts(MoE) 아키텍처를 활용하여 모델의 역량을 강화하고, 각 전문가가 다른 시나리오에 집중할 수 있도록 합니다. 마지막으로, 인간의 선호도에 부합하도록 모델을 조정하기 위해 새로운 온라인 강화 학습 프레임워크를 설계했습니다. 이 프레임워크는 다중 주체 충실도를 정확히 평가하는 채점 메커니즘과 MoE 아키텍처에 맞춘 더 안정적인 학습 전략을 특징으로 합니다. 실험을 통해 우리의 프레임워크가 주체 충실도를 크게 개선하면서도 인간의 선호도에 더 잘 부합함을 검증했습니다.

English

Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model's capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.