ChatPaper.aiChatPaper

與任何人:邁向可控且身份一致之圖像生成

WithAnyone: Towards Controllable and ID Consistent Image Generation

October 16, 2025
作者: Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
cs.AI

摘要

身份一致性生成已成為文本到圖像研究的重要焦點,近期模型在生成與參考身份對齊的圖像方面取得了顯著成功。然而,由於缺乏包含同一人多次圖像的大規模配對數據集,大多數方法被迫採用基於重建的訓練方式。這種依賴性往往導致我們稱之為“複製-貼上”的失敗模式,即模型直接複製參考面部,而非在姿勢、表情或光照的自然變化中保持身份一致性。這種過度相似性削弱了可控性,並限制了生成的表現力。為解決這些限制,我們(1)構建了一個專為多人物場景設計的大規模配對數據集MultiID-2M,為每個身份提供多樣化的參考;(2)引入了一個基準,量化“複製-貼上”偽影以及身份保真度與變異之間的權衡;(3)提出了一種新穎的訓練範式,利用對比身份損失來平衡保真度與多樣性。這些貢獻最終形成了WithAnyone,這是一個基於擴散的模型,有效減少了“複製-貼上”現象,同時保持了高度的身份相似性。廣泛的定性和定量實驗表明,WithAnyone顯著減少了“複製-貼上”偽影,提高了對姿勢和表情的可控性,並保持了強烈的感知質量。用戶研究進一步驗證了我們的方法在實現高身份保真度的同時,能夠進行富有表現力的可控生成。
English
Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
PDF763October 17, 2025