한 장의 그림이 천 마디 말보다 낫다: 원칙적 재캡션(Recaptioning)이 이미지 생성 품질을 향상시킨다

초록

텍스트-이미지 확산 모델은 지난 몇 년 동안 놀라운 성능 향상을 이루며 텍스트 프롬프트로부터 고품질이고 다양한 이미지 합성을 가능하게 했습니다. 그러나 가장 발전된 모델들조차도 프롬프트의 모든 지시를 정확히 따르는 데 어려움을 겪는 경우가 많습니다. 이러한 모델의 대부분은 (이미지, 캡션) 쌍으로 구성된 데이터셋으로 학습되는데, 이 이미지들은 주로 웹에서 수집되며 캡션은 HTML 대체 텍스트로 제공됩니다. Stable Diffusion 및 기타 모델에서 사용된 LAION 데이터셋이 대표적인 예입니다. 본 연구에서는 이러한 캡션들이 종종 낮은 품질을 보인다는 점을 관찰하고, 이로 인해 모델이 텍스트 프롬프트의 미묘한 의미를 이해하는 능력에 상당한 영향을 미친다고 주장합니다. 우리는 전문적인 자동 캡션 생성 모델을 사용해 데이터셋을 재라벨링하고, 재캡션된 데이터셋으로 텍스트-이미지 모델을 학습시킴으로써 모델이 전반적으로 크게 개선됨을 보여줍니다. 첫째, 전반적인 이미지 품질에서 개선이 나타났습니다: 예를 들어, FID 점수가 14.84로 기준치 17.87보다 향상되었으며, 인간 평가에 따르면 정확한 이미지 생성에서 64.3%의 개선이 있었습니다. 둘째, 의미론적 정렬에서도 개선이 있었습니다: 예를 들어, 의미론적 객체 정확도가 84.34로 78.90보다 향상되었고, 카운팅 정렬 오류가 1.32로 1.44보다 감소했으며, 위치 정렬이 62.42로 57.60보다 향상되었습니다. 우리는 데이터셋을 재라벨링하는 다양한 방법을 분석하고, 이를 RECAP이라고 명명한 이 기술이 학습-추론 간의 불일치를 줄이고 모델에 예제당 더 많은 정보를 제공함으로써 샘플 효율성을 높이고 캡션과 이미지 간의 관계를 더 잘 이해할 수 있게 한다는 증거를 제시합니다.

English

Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.

한 장의 그림이 천 마디 말보다 낫다: 원칙적 재캡션(Recaptioning)이 이미지 생성 품질을 향상시킨다

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation

초록

Support