XVerse：通过DiT调制实现身份与语义属性的一致多主体控制

摘要

在文本到圖像生成中實現對主體身份和語義屬性（姿態、風格、照明）的精細控制，尤其是針對多個主體時，往往會削弱擴散變換器（DiTs）的可編輯性和連貫性。許多方法會引入偽影或遭遇屬性糾纏的問題。為克服這些挑戰，我們提出了一種新穎的多主體控制生成模型XVerse。通過將參考圖像轉化為特定於令牌的文本流調製偏移量，XVerse能夠在不破壞圖像潛在特徵或特徵的情況下，對特定主體進行精確且獨立的控制。因此，XVerse提供了高保真、可編輯的多主體圖像合成，並對個體主體特徵和語義屬性具有強大的控制能力。這一進展顯著提升了個性化和複雜場景生成的能力。

English

Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.

XVerse：通过DiT调制实现身份与语义属性的一致多主体控制

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

摘要

Support