XVerse:通过DiT调制实现身份与语义属性的多主体一致性控制
XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
June 26, 2025
作者: Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu
cs.AI
摘要
在文本到图像生成中,实现对主体身份和语义属性(如姿态、风格、光照)的精细控制,尤其是在涉及多个主体时,往往会削弱扩散变换器(DiTs)的可编辑性和连贯性。许多方法会引入伪影或遭遇属性纠缠问题。为应对这些挑战,我们提出了一种新颖的多主体控制生成模型——XVerse。通过将参考图像转化为特定于标记的文本流调制偏移量,XVerse能够在不干扰图像潜在特征或特征的情况下,实现对特定主体的精确且独立控制。因此,XVerse提供了高保真、可编辑的多主体图像合成能力,并具备对个体主体特征和语义属性的强大控制力。这一进展显著提升了个性化和复杂场景生成的能力。
English
Achieving fine-grained control over subject identity and semantic attributes
(pose, style, lighting) in text-to-image generation, particularly for multiple
subjects, often undermines the editability and coherence of Diffusion
Transformers (DiTs). Many approaches introduce artifacts or suffer from
attribute entanglement. To overcome these challenges, we propose a novel
multi-subject controlled generation model XVerse. By transforming reference
images into offsets for token-specific text-stream modulation, XVerse allows
for precise and independent control for specific subject without disrupting
image latents or features. Consequently, XVerse offers high-fidelity, editable
multi-subject image synthesis with robust control over individual subject
characteristics and semantic attributes. This advancement significantly
improves personalized and complex scene generation capabilities.