OBELICS: 인터리브 이미지-텍스트 문서의 오픈 웹 스케일 필터링 데이터셋

초록

이미지와 텍스트가 혼합된 자연 문서로 학습된 대규모 멀티모달 모델은 이미지-텍스트 쌍으로 학습된 모델보다 다양한 멀티모달 벤치마크에서 더 우수한 성능을 보입니다. 그러나 이러한 모델을 학습하는 데 사용된 데이터셋은 공개되지 않았으며, 데이터 수집 과정도 완전히 명시되지 않았습니다. 우리는 OBELICS 데이터셋을 소개합니다. 이는 Common Crawl에서 추출한 1억 4,100만 개의 웹 페이지, 3억 5,300만 개의 관련 이미지, 그리고 1,150억 개의 텍스트 토큰으로 구성된 오픈 웹 스케일 필터링된 혼합 이미지-텍스트 문서 데이터셋입니다. 우리는 데이터셋 생성 과정을 설명하고, 포괄적인 필터링 규칙을 제시하며, 데이터셋의 내용을 분석합니다. OBELICS의 실용성을 입증하기 위해, 우리는 90억 개와 800억 개의 파라미터를 가진 IDEFICS라는 비전 및 언어 모델을 학습시키고, 다양한 멀티모달 벤치마크에서 경쟁력 있는 성능을 얻습니다. 우리는 데이터셋, 모델 및 코드를 공개합니다.

English

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

OBELICS: 인터리브 이미지-텍스트 문서의 오픈 웹 스케일 필터링 데이터셋

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

초록

Support