주체 일관성과 포즈 다양성을 갖춘 텍스트-이미지 생성

초록

다양한 장면에서 일관된 주체 정체성을 유지하는 것을 목표로 하는 주체 일관성 생성(Subject-consistent generation, SCG)은 텍스트-이미지(T2I) 모델에게 여전히 도전적인 과제입니다. 기존의 학습 없이 수행하는 SCG 방법들은 종종 레이아웃과 포즈 다양성을 희생시키면서 일관성을 달성하므로, 표현력 있는 시각적 스토리텔링을 방해합니다. 이러한 한계를 해결하기 위해, 우리는 일관된 주체 생성과 다양한 포즈 및 레이아웃을 가능하게 하는 CoDi라는 주체 일관성 및 포즈 다양성 T2I 프레임워크를 제안합니다. 디퓨전의 점진적 특성, 즉 거친 구조가 초기에 나타나고 세부 사항이 나중에 정제되는 특성에 동기를 받아, CoDi는 두 단계 전략을 채택합니다: 아이덴티티 전송(Identity Transport, IT)과 아이덴티티 정제(Identity Refinement, IR). IT는 초기 노이즈 제거 단계에서 작동하며, 최적 전송을 사용하여 포즈를 고려한 방식으로 각 대상 이미지에 아이덴티티 특징을 전달합니다. 이는 주체 일관성을 촉진하면서도 포즈 다양성을 보존합니다. IR은 후기 노이즈 제거 단계에서 적용되며, 가장 두드러진 아이덴티티 특징을 선택하여 주체 세부 사항을 더욱 정제합니다. 주체 일관성, 포즈 다양성, 프롬프트 충실도에 대한 광범위한 정성적 및 정량적 결과는 CoDi가 더 나은 시각적 인식과 모든 메트릭에서 더 강력한 성능을 달성함을 보여줍니다. 코드는 https://github.com/NJU-PCALab/CoDi에서 제공됩니다.

English

Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.