Let's Go Shopping (LGS) -- 視覚的概念理解のためのウェブスケール画像-テキストデータセット

要旨

ニューラルネットワークの視覚および視覚-言語アプリケーション、例えば画像分類やキャプション生成は、大規模な注釈付きデータセットに依存しており、その収集プロセスは非自明な作業を必要とします。この時間のかかる取り組みは、大規模データセットの出現を妨げ、研究者や実務者を限られた選択肢に制限しています。そのため、私たちはより効率的な画像収集と注釈付けの方法を模索しています。これまでの取り組みでは、HTMLのaltテキストやソーシャルメディアの投稿からキャプションを収集してきましたが、これらのデータソースはノイズ、スパース性、または主観性に悩まされています。このため、私たちは商業的なショッピングウェブサイトに目を向けました。これらのデータは、清潔さ、情報量、流暢さという3つの基準を満たしています。私たちは、公開されているeコマースウェブサイトから1500万の画像-キャプションペアを収集した大規模な公開データセット「Let's Go Shopping (LGS)」を紹介します。既存の一般ドメインデータセットと比較すると、LGSの画像は前景のオブジェクトに焦点を当てており、背景が複雑ではありません。LGSでの実験では、既存のベンチマークデータセットで訓練された分類器はeコマースデータに容易に一般化しないのに対し、特定の自己教師あり視覚特徴抽出器はより良く一般化できることが示されています。さらに、LGSの高品質なeコマース指向の画像と双峰性は、視覚-言語双峰タスクに有利です。LGSは、画像キャプションモデルがより豊かなキャプションを生成することを可能にし、テキストから画像生成モデルがeコマーススタイルの転送を達成するのに役立ちます。

English

Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.

Let's Go Shopping (LGS) -- 視覚的概念理解のためのウェブスケール画像-テキストデータセット

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding

要旨

Support