WithAnyone: 制御可能かつID一貫性を保った画像生成に向けて

要旨

同一性を保持した生成は、テキストから画像への研究において重要な焦点となっており、最近のモデルは参照となる同一性に沿った画像を生成する点で顕著な成功を収めています。しかし、同一人物の複数の画像を含む大規模なペアデータセットの不足により、ほとんどのアプローチは再構成ベースのトレーニングを採用せざるを得ません。この依存性は、モデルが参照となる顔を直接複製する「コピー＆ペースト」と呼ばれる失敗モードを引き起こし、ポーズ、表情、照明の自然な変化にわたって同一性を保持することができません。このような過度の類似性は制御性を損ない、生成の表現力を制限します。これらの制約に対処するため、我々は（1）複数人物シナリオに特化した大規模なペアデータセット「MultiID-2M」を構築し、各同一性に対して多様な参照を提供します；（2）コピー＆ペーストのアーティファクトと、同一性の忠実度と多様性のトレードオフを定量化するベンチマークを導入します；（3）ペアデータを活用して忠実度と多様性のバランスを取る対照的な同一性損失を用いた新しいトレーニングパラダイムを提案します。これらの貢献により、拡散モデルベースの「WithAnyone」を開発し、コピー＆ペーストを効果的に軽減しながら高い同一性の類似性を保持します。広範な定性的および定量的な実験により、WithAnyoneがコピー＆ペーストのアーティファクトを大幅に削減し、ポーズや表情の制御性を向上させ、強い知覚品質を維持することが示されました。ユーザー調査では、我々の手法が高い同一性の忠実度を達成しつつ、表現力豊かな制御可能な生成を実現することがさらに検証されました。

English

Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.

WithAnyone: 制御可能かつID一貫性を保った画像生成に向けて

WithAnyone: Towards Controllable and ID Consistent Image Generation

要旨

Support