透過圖像標註來改進多模態數據集

摘要

大規模網路資料集在像是CLIP和Flamingo這樣的大型視覺-語言模型的成功中扮演著關鍵角色。然而，原始網路資料存在噪音，現有的減少噪音的過濾方法往往會以降低資料多樣性為代價。我們的研究專注於字幕品質作為噪音的一個主要來源，並研究如何通過生成的字幕來提高具有非具體文字的網路採集數據點的效用。通過探索原始和生成的字幕的不同混合策略，我們在ImageNet上比DataComp基準提出的最佳過濾方法的表現提高了2％，在38個任務中平均提高了4％，考慮到128M候選圖像-文字對。我們最佳的方法在Flickr和MS-COCO檢索方面也提高了2倍。然後，我們分析了合成字幕作為文本監督有效來源的原因。在實驗不同的圖像字幕模型時，我們還展示了模型在標準圖像字幕基準上的表現（例如NoCaps CIDEr）並不是其為多模態訓練生成字幕的效用的可靠指標。最後，我們對在DataComp的大規模（1.28B圖像-文字對）使用生成字幕的實驗提供了對合成文本的限制以及隨著訓練數據量增加圖像策展的重要性的見解。

English

Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity.

透過圖像標註來改進多模態數據集

Improving Multimodal Datasets with Image Captioning

摘要

Support