CoMat: 텍스트-이미지 확산 모델을 이미지-텍스트 개념 매칭과 정렬하기

초록

디퓨전 모델은 텍스트-이미지 생성 분야에서 큰 성공을 거두었습니다. 그러나 텍스트 프롬프트와 이미지 간의 불일치를 완화하는 것은 여전히 어려운 과제로 남아 있습니다. 이러한 불일치의 근본적인 원인은 아직 충분히 연구되지 않았습니다. 우리는 이 불일치가 토큰 주의 활성화의 부적절함에서 비롯된다는 것을 관찰했습니다. 더 나아가, 이러한 현상은 디퓨전 모델의 훈련 패러다임으로 인한 조건 활용의 불충분함에 기인한다고 분석했습니다. 이 문제를 해결하기 위해, 우리는 이미지-텍스트 개념 매칭 메커니즘을 갖춘 종단 간(end-to-end) 디퓨전 모델 미세 조정 전략인 CoMat를 제안합니다. 우리는 이미지 캡셔닝 모델을 활용하여 이미지-텍스트 정렬을 측정하고, 디퓨전 모델이 무시된 토큰을 재검토하도록 유도합니다. 또한, 속성 바인딩 문제를 해결하기 위해 새로운 속성 집중 모듈을 제안합니다. 이미지나 인간 선호 데이터 없이, 단 20,000개의 텍스트 프롬프트만을 사용하여 SDXL을 미세 조정하여 CoMat-SDXL을 얻었습니다. 광범위한 실험을 통해 CoMat-SDXL이 두 가지 텍스트-이미지 정렬 벤치마크에서 기준 모델인 SDXL을 크게 능가하며 최첨단 성능을 달성함을 보여줍니다.

English

Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.

CoMat: 텍스트-이미지 확산 모델을 이미지-텍스트 개념 매칭과 정렬하기

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

초록

Support