通過循環一致性利用非配對數據來增強視覺語言生成模型

摘要

目前的視覺語言生成模型依賴大量配對的圖像-文字數據來達到最佳性能和泛化能力。然而，自動收集這樣的數據（例如通過大規模網絡抓取）會導致低質量和圖像-文字關聯性差，而人工標註則更準確但需要大量手動工作和開支。我們引入了ITIT（InTegrating Image Text）：一種創新的訓練範式，基於循環一致性概念，允許在未配對的圖像和文字數據上進行視覺語言訓練。ITIT由一個聯合圖像-文字編碼器和不相交的圖像和文字解碼器組成，這使得單一框架中實現了雙向的圖像到文字和文字到圖像生成。在訓練期間，ITIT利用一小組配對的圖像-文字數據，確保其輸出在兩個方向上與輸入相當匹配。同時，模型還在僅包含圖像或文字的更大數據集上進行訓練。這是通過強制執行原始未配對樣本與循環生成對應物之間的循環一致性來實現的。例如，它為給定的輸入圖像生成標題，然後使用該標題創建輸出圖像，並確保輸入和輸出圖像之間的相似性。我們的實驗表明，使用未配對數據集的ITIT展現出與使用高質量配對數據相似的擴展行為。我們展示了與最先進的文字到圖像和圖像到文字模型相當的圖像生成和標題性能，只使用了數量級更少（僅3M）的配對圖像-文字數據。

English

Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce ITIT (InTegrating Image Text): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.

通過循環一致性利用非配對數據來增強視覺語言生成模型

Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

摘要

Support