Un'immagine vale più di mille parole: il Recaptioning Principiato Migliora la Generazione di Immagini

Abstract

I modelli di diffusione testo-immagine hanno compiuto un salto significativo nelle capacità negli ultimi anni, consentendo la sintesi di immagini di alta qualità e diversificate a partire da un prompt testuale. Tuttavia, anche i modelli più avanzati spesso faticano a seguire con precisione tutte le indicazioni presenti nei loro prompt. La stragrande maggioranza di questi modelli è addestrata su dataset composti da coppie (immagine, didascalia), dove le immagini provengono spesso dal web e le didascalie corrispondono al testo alternativo HTML. Un esempio notevole è il dataset LAION, utilizzato da Stable Diffusion e altri modelli. In questo lavoro osserviamo che queste didascalie sono spesso di bassa qualità e sosteniamo che ciò influisca significativamente sulla capacità del modello di comprendere la semantica sfumata nei prompt testuali. Dimostriamo che, rietichettando il corpus con un modello automatico di didascalia specializzato e addestrando un modello testo-immagine sul dataset rietichettato, il modello ne trae vantaggio in modo sostanziale su tutti i fronti. Innanzitutto, nella qualità complessiva delle immagini: ad esempio, FID 14,84 rispetto al valore di riferimento di 17,87, e un miglioramento del 64,3% nella generazione fedele di immagini secondo la valutazione umana. In secondo luogo, nell'allineamento semantico, ad esempio, accuratezza semantica degli oggetti 84,34 rispetto a 78,90, errori di allineamento nel conteggio 1,32 rispetto a 1,44 e allineamento posizionale 62,42 rispetto a 57,60. Analizziamo vari modi per rietichettare il corpus e forniamo prove che questa tecnica, che chiamiamo RECAP, riduce sia la discrepanza tra addestramento e inferenza sia fornisce al modello più informazioni per esempio, aumentando l'efficienza del campionamento e consentendo al modello di comprendere meglio le relazioni tra didascalie e immagini.

English

Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to precisely follow all of the directions in their prompts. The vast majority of these models are trained on datasets consisting of (image, caption) pairs where the images often come from the web, and the captions are their HTML alternate text. A notable example is the LAION dataset, used by Stable Diffusion and other models. In this work we observe that these captions are often of low quality, and argue that this significantly affects the model's capability to understand nuanced semantics in the textual prompts. We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board. First, in overall image quality: e.g. FID 14.84 vs. the baseline of 17.87, and 64.3% improvement in faithful image generation according to human evaluation. Second, in semantic alignment, e.g. semantic object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example, increasing sample efficiency and allowing the model to better understand the relations between captions and images.

Un'immagine vale più di mille parole: il Recaptioning Principiato Migliora la Generazione di Immagini

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation

Abstract

Support