ChatPaper.aiChatPaper

主題一致且姿態多樣的文本到圖像生成

Subject-Consistent and Pose-Diverse Text-to-Image Generation

July 11, 2025
作者: Zhanxin Gao, Beier Zhu, Liang Yao, Jian Yang, Ying Tai
cs.AI

摘要

主題一致性生成(Subject-consistent Generation, SCG)——旨在跨多樣場景保持主體身份的一致性——對於文本到圖像(Text-to-Image, T2I)模型而言仍是一大挑戰。現有的免訓練SCG方法往往以犧牲佈局和姿態多樣性為代價來實現一致性,這限制了視覺敘事的表現力。為解決這一局限,我們提出了一種主題一致且姿態多樣的T2I框架,命名為CoDi,該框架能夠在保持多樣姿態和佈局的同時生成一致的主體。受擴散過程漸進特性的啟發,即粗結構早期顯現而細節後期精煉,CoDi採用了兩階段策略:身份傳輸(Identity Transport, IT)和身份精煉(Identity Refinement, IR)。IT在早期去噪步驟中運作,利用最優傳輸以姿態感知的方式將身份特徵傳遞至每個目標圖像,從而促進主體一致性同時保留姿態多樣性。IR則應用於後期去噪步驟,選取最顯著的身份特徵以進一步精煉主體細節。在主題一致性、姿態多樣性及提示忠實度方面的大量定性與定量結果表明,CoDi在所有指標上均實現了更佳的視覺感知與更強的表現力。代碼已提供於https://github.com/NJU-PCALab/CoDi。
English
Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.
PDF131July 16, 2025