OBELICS：一个开放的网络规模筛选数据集，包含交错的图像文本文档。

摘要

在自然文档上训练的大型多模态模型，交替使用图像和文本，比在图像-文本对上训练的模型在各种多模态基准测试中表现更好。然而，用于训练这些模型的数据集尚未发布，并且收集过程尚未完全说明。我们介绍了OBELICS数据集，这是一个包含来自Common Crawl的1.41亿个网页、3.53亿个相关图像和1150亿个文本标记的开放式网络规模过滤数据集，其中包含交错的图像文本文档。我们描述了数据集创建过程，提出了全面的过滤规则，并对数据集内容进行了分析。为了展示OBELICS的可行性，我们训练了分别命名为IDEFICS的9亿和80亿参数的视觉和语言模型，并在不同的多模态基准测试中获得了竞争性能。我们发布了我们的数据集、模型和代码。

English

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

OBELICS：一个开放的网络规模筛选数据集，包含交错的图像文本文档。

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

摘要

Support