與任何人:邁向可控且身份一致之圖像生成
WithAnyone: Towards Controllable and ID Consistent Image Generation
October 16, 2025
作者: Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
cs.AI
摘要
身份一致性生成已成為文本到圖像研究的重要焦點,近期模型在生成與參考身份對齊的圖像方面取得了顯著成功。然而,由於缺乏包含同一人多次圖像的大規模配對數據集,大多數方法被迫採用基於重建的訓練方式。這種依賴性往往導致我們稱之為“複製-貼上”的失敗模式,即模型直接複製參考面部,而非在姿勢、表情或光照的自然變化中保持身份一致性。這種過度相似性削弱了可控性,並限制了生成的表現力。為解決這些限制,我們(1)構建了一個專為多人物場景設計的大規模配對數據集MultiID-2M,為每個身份提供多樣化的參考;(2)引入了一個基準,量化“複製-貼上”偽影以及身份保真度與變異之間的權衡;(3)提出了一種新穎的訓練範式,利用對比身份損失來平衡保真度與多樣性。這些貢獻最終形成了WithAnyone,這是一個基於擴散的模型,有效減少了“複製-貼上”現象,同時保持了高度的身份相似性。廣泛的定性和定量實驗表明,WithAnyone顯著減少了“複製-貼上”偽影,提高了對姿勢和表情的可控性,並保持了強烈的感知質量。用戶研究進一步驗證了我們的方法在實現高身份保真度的同時,能夠進行富有表現力的可控生成。
English
Identity-consistent generation has become an important focus in text-to-image
research, with recent models achieving notable success in producing images
aligned with a reference identity. Yet, the scarcity of large-scale paired
datasets containing multiple images of the same individual forces most
approaches to adopt reconstruction-based training. This reliance often leads to
a failure mode we term copy-paste, where the model directly replicates the
reference face rather than preserving identity across natural variations in
pose, expression, or lighting. Such over-similarity undermines controllability
and limits the expressive power of generation. To address these limitations, we
(1) construct a large-scale paired dataset MultiID-2M, tailored for
multi-person scenarios, providing diverse references for each identity; (2)
introduce a benchmark that quantifies both copy-paste artifacts and the
trade-off between identity fidelity and variation; and (3) propose a novel
training paradigm with a contrastive identity loss that leverages paired data
to balance fidelity with diversity. These contributions culminate in
WithAnyone, a diffusion-based model that effectively mitigates copy-paste while
preserving high identity similarity. Extensive qualitative and quantitative
experiments demonstrate that WithAnyone significantly reduces copy-paste
artifacts, improves controllability over pose and expression, and maintains
strong perceptual quality. User studies further validate that our method
achieves high identity fidelity while enabling expressive controllable
generation.