被写体の一貫性とポーズの多様性を備えたテキストから画像への生成

要旨

主題一貫生成（Subject-consistent Generation, SCG）—多様なシーン間で主題の同一性を維持することを目指す—は、テキストから画像（Text-to-Image, T2I）モデルにとって依然として課題となっている。既存の学習不要なSCG手法は、レイアウトやポーズの多様性を犠牲にして一貫性を達成することが多く、表現力豊かなビジュアルストーリーテリングを妨げている。この制限に対処するため、我々は主題の一貫性とポーズの多様性を両立するT2Iフレームワーク「CoDi」を提案する。CoDiは、拡散プロセスの漸進的な性質—粗い構造が早期に現れ、細部が後で洗練される—に着想を得て、2段階の戦略を採用している：Identity Transport（IT）とIdentity Refinement（IR）である。ITは初期のノイズ除去ステップで動作し、最適輸送を用いてポーズを考慮した方法で各ターゲット画像に同一性特徴を転送する。これにより、ポーズの多様性を保ちつつ主題の一貫性を促進する。IRは後期のノイズ除去ステップで適用され、最も顕著な同一性特徴を選択して主題の細部をさらに洗練する。主題の一貫性、ポーズの多様性、プロンプト忠実度に関する広範な定性的および定量的な結果は、CoDiがすべての指標においてより優れた視覚的知覚と強力な性能を達成することを示している。コードはhttps://github.com/NJU-PCALab/CoDiで提供されている。

English

Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.

被写体の一貫性とポーズの多様性を備えたテキストから画像への生成

Subject-Consistent and Pose-Diverse Text-to-Image Generation

要旨

Support