一圖勝過千言：原則性重新標題改進圖像生成

摘要

文字到圖像擴散模型在過去幾年取得了顯著的進展，使得從文本提示中合成高質量且多樣化的圖像成為可能。然而，即使是最先進的模型通常也難以精確地遵循所有提示中的指示。這些模型中絕大多數是在由（圖像，標題）對組成的數據集上進行訓練，其中圖像通常來自網絡，而標題則是它們的HTML替代文本。一個著名的例子是LAION數據集，被Stable Diffusion和其他模型使用。在這項工作中，我們觀察到這些標題通常質量較低，並認為這顯著影響了模型理解文本提示中微妙語義的能力。我們展示通過使用專門的自動標題生成模型重新標記語料庫並在重新標記的數據集上訓練文字到圖像模型，模型在各方面都獲益良多。首先，在整體圖像質量方面：例如FID為14.84，而基準為17.87，以及根據人類評估，忠實圖像生成的改善率為64.3%。其次，在語義對齊方面，例如語義對象準確率為84.34，而基準為78.90，計數對齊錯誤為1.32，而基準為1.44，位置對齊為62.42，而基準為57.60。我們分析了重新標記語料庫的各種方式，並提供證據表明這種技術，我們稱之為RECAP，既減少了訓練-推斷的差異，又為模型提供了更多每個示例的信息，提高了樣本效率，使模型更好地理解標題和圖像之間的關係。

English

Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.

一圖勝過千言：原則性重新標題改進圖像生成

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation

摘要

Support