영역 인식 이중 모드 직접 선호 최적화를 통한 구성적 텍스트-이미지 생성

초록

텍스트-이미지(T2I) 모델의 급속한 발전에도 불구하고, 속성 결합, 객체 관계, 개수 세기 등을 포함한 복잡한 구성적 프롬프트를 정확히 반영하는 이미지를 생성하는 것은 여전히 어려운 과제입니다. 이를 해결하기 위해, 우리는 BiDPO를 제안합니다. 이 프레임워크는 T2I 모델의 구성적 텍스트-이미지 생성 능력을 향상시킵니다. 먼저, 엄격한 품질 관리를 통해 대규모 선호도 데이터셋인 BiComp를 구축하기 위해 신중하게 설계된 파이프라인을 소개합니다. 그런 다음, Diffusion DPO를 확장하여 이미지와 텍스트 선호도를 공동 최적화합니다. 이는 모델이 생성 시 복잡한 텍스트 프롬프트를 따르도록 개선하는 데 매우 효과적인 것으로 나타났습니다. 세밀한 정렬을 위해 모델을 더욱 향상시키기 위해, 우리는 구성적 개념과 관련된 영역에 집중하는 영역 수준 안내 방법을 사용합니다. 실험 결과는 우리의 BiDPO가 구성적 충실도를 크게 향상시키며, 여러 벤치마크에서 일관되게 이전 방법들을 능가함을 보여줍니다. 우리의 접근 방식은 복잡한 텍스트-이미지 작업에 대한 선호도 기반 미세 조정의 잠재력을 강조하며, 기존 기술에 대한 유연하고 확장 가능한 대안을 제공합니다.

English

Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BiDPO, a framework to enhance T2I model's capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BiComp, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BiDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.