让我们去购物(LGS)-- 用于视觉概念理解的大规模图像文本数据集
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
January 9, 2024
作者: Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli, Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot Branson, Aerin Kim, Somayeh Sojoudi, Kyunghyun Cho
cs.AI
摘要
神经网络的视觉和视觉语言应用,如图像分类和字幕生成,依赖于需要进行繁琐数据收集过程的大规模注释数据集。这种耗时的工作阻碍了大规模数据集的出现,限制了研究人员和从业者的选择。因此,我们寻求更高效的图像收集和标注方法。先前的尝试从HTML alt文本和社交媒体帖子中收集字幕,但这些数据源存在噪音、稀疏性或主观性问题。因此,我们转向商业购物网站,这些数据满足三个标准:干净、信息丰富和流畅。我们介绍了Let's Go Shopping(LGS)数据集,这是一个来自公开电子商务网站的包含1500万图像-字幕对的大规模公共数据集。与现有的通用领域数据集相比,LGS的图像侧重于前景对象,背景较简单。我们在LGS上的实验表明,在现有基准数据集上训练的分类器不容易泛化到电子商务数据,而特定的自监督视觉特征提取器可以更好地泛化。此外,LGS的高质量电子商务焦点图像和双模态特性使其在视觉语言双模任务中具有优势:LGS使图像字幕生成模型能够生成更丰富的字幕,并帮助文本到图像生成模型实现电子商务风格转换。
English
Vision and vision-language applications of neural networks, such as image
classification and captioning, rely on large-scale annotated datasets that
require non-trivial data-collecting processes. This time-consuming endeavor
hinders the emergence of large-scale datasets, limiting researchers and
practitioners to a small number of choices. Therefore, we seek more efficient
ways to collect and annotate images. Previous initiatives have gathered
captions from HTML alt-texts and crawled social media postings, but these data
sources suffer from noise, sparsity, or subjectivity. For this reason, we turn
to commercial shopping websites whose data meet three criteria: cleanliness,
informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset,
a large-scale public dataset with 15 million image-caption pairs from publicly
available e-commerce websites. When compared with existing general-domain
datasets, the LGS images focus on the foreground object and have less complex
backgrounds. Our experiments on LGS show that the classifiers trained on
existing benchmark datasets do not readily generalize to e-commerce data, while
specific self-supervised visual feature extractors can better generalize.
Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature
make it advantageous for vision-language bi-modal tasks: LGS enables
image-captioning models to generate richer captions and helps text-to-image
generation models achieve e-commerce style transfer.