XVerse：通过DiT调制实现身份与语义属性的多主体一致性控制

摘要

在文本到图像生成中，实现对主体身份和语义属性（如姿态、风格、光照）的精细控制，尤其是在涉及多个主体时，往往会削弱扩散变换器（DiTs）的可编辑性和连贯性。许多方法会引入伪影或遭遇属性纠缠问题。为应对这些挑战，我们提出了一种新颖的多主体控制生成模型——XVerse。通过将参考图像转化为特定于标记的文本流调制偏移量，XVerse能够在不干扰图像潜在特征或特征的情况下，实现对特定主体的精确且独立控制。因此，XVerse提供了高保真、可编辑的多主体图像合成能力，并具备对个体主体特征和语义属性的强大控制力。这一进展显著提升了个性化和复杂场景生成的能力。

English

Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.

XVerse：通过DiT调制实现身份与语义属性的多主体一致性控制

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

摘要

Support