MOSAIC：基於對應感知對齊與解耦的多學科個性化生成

摘要

多主體個性化生成在基於多個參考主體合成圖像時，面臨著保持身份忠實性和語義連貫性的獨特挑戰。現有方法由於未能充分建模不同主體在共享表示空間中應如何互動，常遭遇身份混合和屬性洩露的問題。我們提出了MOSAIC，這是一個以表示為中心的框架，通過顯式的語義對應和正交特徵解耦，重新思考多主體生成。我們的關鍵洞見是，多主體生成需要在表示層面實現精確的語義對齊——明確知道生成圖像中的哪些區域應關注每個參考的哪些部分。為此，我們引入了SemAlign-MS，這是一個精心註釋的數據集，提供了多個參考主體與目標圖像之間的細粒度語義對應，此前在該領域尚不可得。基於此，我們提出了語義對應注意力損失，以強制精確的點對點語義對齊，確保從每個參考到其指定區域的高度一致性。此外，我們開發了多參考解耦損失，將不同主體推入正交的注意力子空間，防止特徵干擾的同時保留個體身份特徵。大量實驗表明，MOSAIC在多個基準測試中達到了最先進的性能。值得注意的是，當現有方法通常在超過3個主體時性能下降，MOSAIC在4個及以上參考主體的情況下仍保持高保真度，為複雜的多主體合成應用開闢了新的可能性。

English

Multi-subject personalized generation presents unique challenges in maintaining identity fidelity and semantic coherence when synthesizing images conditioned on multiple reference subjects. Existing methods often suffer from identity blending and attribute leakage due to inadequate modeling of how different subjects should interact within shared representation spaces. We present MOSAIC, a representation-centric framework that rethinks multi-subject generation through explicit semantic correspondence and orthogonal feature disentanglement. Our key insight is that multi-subject generation requires precise semantic alignment at the representation level - knowing exactly which regions in the generated image should attend to which parts of each reference. To enable this, we introduce SemAlign-MS, a meticulously annotated dataset providing fine-grained semantic correspondences between multiple reference subjects and target images, previously unavailable in this domain. Building on this foundation, we propose the semantic correspondence attention loss to enforce precise point-to-point semantic alignment, ensuring high consistency from each reference to its designated regions. Furthermore, we develop the multi-reference disentanglement loss to push different subjects into orthogonal attention subspaces, preventing feature interference while preserving individual identity characteristics. Extensive experiments demonstrate that MOSAIC achieves state-of-the-art performance on multiple benchmarks. Notably, while existing methods typically degrade beyond 3 subjects, MOSAIC maintains high fidelity with 4+ reference subjects, opening new possibilities for complex multi-subject synthesis applications.