OBELICS: インターレーブされた画像-テキスト文書のオープンなウェブスケールフィルタリングデータセット

要旨

画像とテキストが交互に配置された自然文書で訓練された大規模マルチモーダルモデルは、画像とテキストのペアで訓練されたモデルを様々なマルチモーダルベンチマークで上回っています。しかし、これらのモデルの訓練に使用されたデータセットは公開されておらず、収集プロセスも完全には明示されていません。本研究では、OBELICSデータセットを紹介します。これは、Common Crawlから抽出された1億4100万のウェブページ、3億5300万の関連画像、1150億のテキストトークンからなる、オープンなウェブスケールのフィルタリングされた交互配置画像テキスト文書のデータセットです。データセットの作成プロセスを説明し、包括的なフィルタリングルールを提示し、データセットの内容分析を提供します。OBELICSの有効性を示すために、9億および800億パラメータの視覚と言語モデルであるIDEFICSを訓練し、異なるマルチモーダルベンチマークで競争力のある性能を達成しました。私たちは、データセット、モデル、およびコードを公開します。

English

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

OBELICS: インターレーブされた画像-テキスト文書のオープンなウェブスケールフィルタリングデータセット

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

要旨

Support