WithAnyone:迈向可控且身份一致性的图像生成
WithAnyone: Towards Controllable and ID Consistent Image Generation
October 16, 2025
作者: Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
cs.AI
摘要
身份一致性生成已成为文本到图像研究的重要方向,近期模型在生成与参考身份对齐的图像方面取得了显著成功。然而,由于缺乏包含同一人物多张图像的大规模配对数据集,大多数方法不得不采用基于重建的训练方式。这种依赖往往导致我们称之为“复制粘贴”的失败模式,即模型直接复制参考面部,而非在姿态、表情或光照的自然变化中保持身份一致性。这种过度相似性削弱了可控性,限制了生成的表达能力。为解决这些局限,我们(1)构建了专为多人物场景设计的大规模配对数据集MultiID-2M,为每个身份提供多样化的参考;(2)引入了一个基准,量化复制粘贴伪影以及身份保真度与变化之间的权衡;(3)提出了一种新颖的训练范式,采用对比身份损失,利用配对数据在保真度与多样性之间取得平衡。这些成果最终汇聚于WithAnyone,一个基于扩散的模型,有效缓解了复制粘贴问题,同时保持了高身份相似性。广泛的定性和定量实验表明,WithAnyone显著减少了复制粘贴伪影,提升了对姿态和表情的可控性,并保持了强大的感知质量。用户研究进一步验证了我们的方法在实现高身份保真度的同时,支持富有表现力的可控生成。
English
Identity-consistent generation has become an important focus in text-to-image
research, with recent models achieving notable success in producing images
aligned with a reference identity. Yet, the scarcity of large-scale paired
datasets containing multiple images of the same individual forces most
approaches to adopt reconstruction-based training. This reliance often leads to
a failure mode we term copy-paste, where the model directly replicates the
reference face rather than preserving identity across natural variations in
pose, expression, or lighting. Such over-similarity undermines controllability
and limits the expressive power of generation. To address these limitations, we
(1) construct a large-scale paired dataset MultiID-2M, tailored for
multi-person scenarios, providing diverse references for each identity; (2)
introduce a benchmark that quantifies both copy-paste artifacts and the
trade-off between identity fidelity and variation; and (3) propose a novel
training paradigm with a contrastive identity loss that leverages paired data
to balance fidelity with diversity. These contributions culminate in
WithAnyone, a diffusion-based model that effectively mitigates copy-paste while
preserving high identity similarity. Extensive qualitative and quantitative
experiments demonstrate that WithAnyone significantly reduces copy-paste
artifacts, improves controllability over pose and expression, and maintains
strong perceptual quality. User studies further validate that our method
achieves high identity fidelity while enabling expressive controllable
generation.