OBELICS：一個開放的網絡規模篩選數據集，包含交錯的圖像文本文檔。

摘要

在自然文件上訓練的大型多模型模型，交錯使用圖像和文本，比在圖像-文本對上訓練的模型在各種多模基準測試中表現更好。然而，用於訓練這些模型的數據集尚未公開，且收集過程尚未完全明確。我們介紹了OBELICS數據集，這是一個開放的網絡規模篩選數據集，包括從Common Crawl提取的1.41億個網頁、3.53億個相關圖像和1150億個文本標記。我們描述了數據集創建過程，提出了全面的篩選規則，並對數據集的內容進行了分析。為了展示OBELICS的可行性，我們訓練了9和80億參數的視覺和語言模型，名為IDEFICS，在不同的多模基準測試中取得了競爭性表現。我們公開了我們的數據集、模型和代碼。

English

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

OBELICS：一個開放的網絡規模篩選數據集，包含交錯的圖像文本文檔。

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

摘要

Support