サイクル一貫性を活用した非ペアデータによる視覚-言語生成モデルの強化

要旨

現在の視覚-言語生成モデルは、最適な性能と汎化能力を達成するために、大規模な画像-テキストペアデータセットに依存しています。しかし、自動的にそのようなデータを収集する（例えば、大規模なウェブスクレイピングを通じて）と、品質が低く、画像とテキストの相関が弱いという問題が生じます。一方、人間によるアノテーションはより正確ですが、多大な手作業と費用を必要とします。本論文では、ITIT（InTegrating Image Text）を紹介します。これは、サイクル一貫性の概念に基づいた革新的なトレーニングパラダイムであり、ペアになっていない画像とテキストデータを用いて視覚-言語トレーニングを可能にします。ITITは、結合された画像-テキストエンコーダと分離された画像およびテキストデコーダで構成され、単一のフレームワーク内で双方向の画像からテキスト、およびテキストから画像の生成を実現します。トレーニング中、ITITは少量のペア画像-テキストデータを活用して、出力が入力と両方向で合理的に一致することを保証します。同時に、モデルは画像のみまたはテキストのみを含むはるかに大規模なデータセットでもトレーニングされます。これは、元のペアになっていないサンプルとサイクル生成された対応物との間にサイクル一貫性を強制することで実現されます。例えば、入力画像に対してキャプションを生成し、そのキャプションを使用して出力画像を作成し、入力画像と出力画像の類似性を強制します。実験結果は、ペアになっていないデータセットを用いたITITが、高品質なペアデータを使用した場合と同様のスケーリング挙動を示すことを示しています。また、ITITは、ペア画像-テキストデータが桁違いに少ない（わずか300万）場合でも、最先端のテキストから画像および画像からテキストモデルと同等の画像生成およびキャプショニング性能を実現します。

English

Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce ITIT (InTegrating Image Text): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.

サイクル一貫性を活用した非ペアデータによる視覚-言語生成モデルの強化

Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

要旨

Support