让我们去购物（LGS）-- 用于视觉概念理解的大规模图像文本数据集

摘要

神经网络的视觉和视觉语言应用，如图像分类和字幕生成，依赖于需要进行繁琐数据收集过程的大规模注释数据集。这种耗时的工作阻碍了大规模数据集的出现，限制了研究人员和从业者的选择。因此，我们寻求更高效的图像收集和标注方法。先前的尝试从HTML alt文本和社交媒体帖子中收集字幕，但这些数据源存在噪音、稀疏性或主观性问题。因此，我们转向商业购物网站，这些数据满足三个标准：干净、信息丰富和流畅。我们介绍了Let's Go Shopping（LGS）数据集，这是一个来自公开电子商务网站的包含1500万图像-字幕对的大规模公共数据集。与现有的通用领域数据集相比，LGS的图像侧重于前景对象，背景较简单。我们在LGS上的实验表明，在现有基准数据集上训练的分类器不容易泛化到电子商务数据，而特定的自监督视觉特征提取器可以更好地泛化。此外，LGS的高质量电子商务焦点图像和双模态特性使其在视觉语言双模任务中具有优势：LGS使图像字幕生成模型能够生成更丰富的字幕，并帮助文本到图像生成模型实现电子商务风格转换。

English

Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.

让我们去购物（LGS）-- 用于视觉概念理解的大规模图像文本数据集

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding

摘要

Support