MultiCrafter: 空間的に分離された注意機構とアイデンティティ認識強化学習による高忠実度マルチサブジェクト生成

要旨

マルチサブジェクト画像生成は、ユーザーが提供した複数のサブジェクトを単一の画像内に合成しつつ、サブジェクトの忠実性を保ち、プロンプトの一貫性を確保し、人間の美的嗜好に沿うことを目指すものです。しかし、既存の手法、特にIn-Context-Learningパラダイムに基づくものは、単純な再構成ベースの目的関数に依存しているため、サブジェクトの忠実性を損なう深刻な属性漏洩が発生し、また、微妙な人間の嗜好に沿うことができません。この問題を解決するため、我々はMultiCrafterを提案します。これは、高忠実性かつ嗜好に沿った生成を実現するフレームワークです。まず、属性漏洩の根本原因は、生成プロセス中に異なるサブジェクト間のアテンションが著しく絡み合うことにあることを発見しました。そこで、各サブジェクトのアテンション領域を明示的に分離するために、明示的な位置監督を導入し、属性漏洩を効果的に軽減します。さらに、モデルが多様なシナリオで異なるサブジェクトのアテンション領域を正確に計画できるように、Mixture-of-Experts（MoE）アーキテクチャを採用し、異なる専門家が異なるシナリオに集中できるようにモデルの能力を強化します。最後に、人間の嗜好にモデルを適合させるために、新しいオンライン強化学習フレームワークを設計しました。これには、マルチサブジェクトの忠実性を正確に評価するスコアリングメカニズムと、MoEアーキテクチャに特化したより安定したトレーニング戦略が含まれます。実験により、我々のフレームワークがサブジェクトの忠実性を大幅に向上させ、人間の嗜好により良く沿うことが検証されました。

English

Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model's capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.