XVerse:通过DiT调制实现身份与语义属性的一致多主体控制
XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
June 26, 2025
作者: Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu
cs.AI
摘要
在文本到圖像生成中實現對主體身份和語義屬性(姿態、風格、照明)的精細控制,尤其是針對多個主體時,往往會削弱擴散變換器(DiTs)的可編輯性和連貫性。許多方法會引入偽影或遭遇屬性糾纏的問題。為克服這些挑戰,我們提出了一種新穎的多主體控制生成模型XVerse。通過將參考圖像轉化為特定於令牌的文本流調製偏移量,XVerse能夠在不破壞圖像潛在特徵或特徵的情況下,對特定主體進行精確且獨立的控制。因此,XVerse提供了高保真、可編輯的多主體圖像合成,並對個體主體特徵和語義屬性具有強大的控制能力。這一進展顯著提升了個性化和複雜場景生成的能力。
English
Achieving fine-grained control over subject identity and semantic attributes
(pose, style, lighting) in text-to-image generation, particularly for multiple
subjects, often undermines the editability and coherence of Diffusion
Transformers (DiTs). Many approaches introduce artifacts or suffer from
attribute entanglement. To overcome these challenges, we propose a novel
multi-subject controlled generation model XVerse. By transforming reference
images into offsets for token-specific text-stream modulation, XVerse allows
for precise and independent control for specific subject without disrupting
image latents or features. Consequently, XVerse offers high-fidelity, editable
multi-subject image synthesis with robust control over individual subject
characteristics and semantic attributes. This advancement significantly
improves personalized and complex scene generation capabilities.