通过循环一致性利用非配对数据用于视觉-语言生成模型

摘要

当前的视觉-语言生成模型依赖于大规模配对的图像-文本数据来实现最佳性能和泛化能力。然而，自动收集这种数据（例如通过大规模网络抓取）会导致低质量和图像-文本相关性差，而人工标注更准确但需要大量手动工作和费用。我们引入了ITIT（InTegrating Image Text）：一种创新的训练范式，基于循环一致性概念，允许在未配对的图像和文本数据上进行视觉-语言训练。ITIT由一个联合图像-文本编码器和不相交的图像和文本解码器组成，实现了单一框架内的双向图像到文本和文本到图像的生成。在训练过程中，ITIT利用一小部分配对的图像-文本数据，确保其输出在双向上与输入相匹配。同时，该模型还在仅包含图像或文本的更大数据集上进行训练。这是通过强制实现原始未配对样本与循环生成对应物之间的循环一致性来实现的。例如，它为给定的输入图像生成标题，然后使用标题创建输出图像，并确保输入和输出图像之间的相似性。我们的实验表明，ITIT在未配对数据集上表现出与使用高质量配对数据相似的扩展行为。我们展示了图像生成和字幕性能，与最先进的文本到图像和图像到文本模型相当，但只使用了数量级更少（仅300万）的配对图像-文本数据。

English

Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce ITIT (InTegrating Image Text): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.

通过循环一致性利用非配对数据用于视觉-语言生成模型

Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

摘要

Support