一枚の画像は千の言葉に値する：原則に基づいた再キャプションが画像生成を改善する

要旨

テキストから画像を生成する拡散モデルは、ここ数年で驚異的な進化を遂げ、テキストプロンプトから高品質で多様な画像を合成することが可能になりました。しかし、最も先進的なモデルでさえ、プロンプト内のすべての指示を正確に追従することに苦戦することが少なくありません。これらのモデルの大多数は、（画像、キャプション）ペアで構成されるデータセットで学習されており、画像はウェブから取得され、キャプションはHTMLの代替テキストであることが多いです。代表的な例として、Stable Diffusionや他のモデルで使用されるLAIONデータセットが挙げられます。本研究では、これらのキャプションがしばしば低品質であることを観察し、これがテキストプロンプト内の微妙な意味をモデルが理解する能力に大きな影響を与えていると主張します。専門的な自動キャプションモデルを使用してコーパスを再ラベルし、再キャプションされたデータセットでテキストから画像を生成するモデルを学習させることで、モデルが全体的に大幅に改善されることを示します。まず、画像の全体的な品質において、例えばFIDが14.84（ベースラインは17.87）であり、人間による評価では忠実な画像生成が64.3%向上しました。次に、意味的整合性において、例えば意味的オブジェクト精度が84.34（ベースラインは78.90）、カウント整合性エラーが1.32（ベースラインは1.44）、位置整合性が62.42（ベースラインは57.60）となりました。コーパスを再ラベルする様々な方法を分析し、RECAPと呼ぶこの技術が、学習と推論の不一致を減少させ、モデルに各サンプルあたりの情報量を増やし、サンプル効率を向上させ、キャプションと画像の関係をより良く理解させることを示す証拠を提供します。

English

Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.

一枚の画像は千の言葉に値する：原則に基づいた再キャプションが画像生成を改善する

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation

要旨

Support