利用图像字幕改进多模态数据集

摘要

大规模网络数据集在类似CLIP和Flamingo这样的大型视觉-语言模型的成功中发挥着关键作用。然而，原始网络数据存在噪音，并且现有的减少噪音的过滤方法往往会以数据多样性为代价。我们的研究聚焦于字幕质量作为噪音的一个主要来源，并研究生成的字幕如何能够提高带有不明确文本的网络抓取数据点的效用。通过探索原始字幕和生成字幕的不同混合策略，我们在ImageNet上比DataComp基准提出的最佳过滤方法高出2％，在38个任务的平均值上高出4％，候选池中包含1.28亿个图像-文本对。我们的最佳方法在Flickr和MS-COCO检索方面也提高了2倍。然后，我们分析了合成字幕作为文本监督有效来源的原因。在尝试不同的图像字幕模型时，我们还证明了模型在标准图像字幕基准（例如NoCaps CIDEr）上的表现并不是其为多模态训练生成字幕的效用的可靠指标。最后，我们在DataComp的大规模（12.8亿个图像-文本对）实验中使用生成字幕，为我们揭示了合成文本的局限性以及随着训练数据量增加，图像筛选的重要性。

English

Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity.

利用图像字幕改进多模态数据集

Improving Multimodal Datasets with Image Captioning

摘要

Support