讓我們一起逛街（LGS）-- 用於視覺概念理解的網絡規模圖像文本數據集

摘要

神經網絡在視覺和視覺語言應用中，如圖像分類和標題生成，依賴大規模標註數據集，需要耗費大量時間進行數據收集。這一耗時的工作阻礙了大規模數據集的出現，限制了研究人員和從業者的選擇。因此，我們尋求更有效的方式來收集和標註圖像。先前的倡議從HTML alt文本和社交媒體帖子中獲取標題，但這些數據源存在噪音、稀疏性或主觀性。因此，我們轉向商業購物網站，其數據符合三個標準：清潔度、信息量和流暢度。我們介紹了“一起去購物”（LGS）數據集，這是一個來自公開電子商務網站的包含1500萬圖像-標題對的大規模公共數據集。與現有的通用領域數據集相比，LGS圖像聚焦於前景對象，背景較簡單。我們對LGS的實驗表明，現有基準數據集上訓練的分類器不容易泛化到電子商務數據，而特定的自監督視覺特徵提取器可以更好地泛化。此外，LGS的高質量電子商務專注圖像和雙模態特性使其在視覺語言雙模任務中具有優勢：LGS使圖像標題生成模型能夠生成更豐富的標題，並幫助文本到圖像生成模型實現電子商務風格轉換。

English

Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.

讓我們一起逛街（LGS）-- 用於視覺概念理解的網絡規模圖像文本數據集

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding

摘要

Support