OmniCorpus: 100억 수준의 이미지와 텍스트가 교차된 통합 멀티모달 코퍼스

초록

이미지와 텍스트가 자연스러운 문서 형식으로 배열된 이미지-텍스트 인터리브 데이터는 인터넷 데이터의 표현 패러다임과 일치하며 인간의 읽기 습관과 매우 유사합니다. 최근 연구에 따르면, 이러한 데이터는 멀티모달 인컨텍스트 학습을 돕고 멀티모달 미세 조정 중에 대형 언어 모델의 능력을 유지하는 데 도움이 됩니다. 그러나 현재의 이미지-텍스트 인터리브 데이터의 제한된 규모와 다양성은 멀티모달 대형 언어 모델의 발전을 제한하고 있습니다. 본 논문에서는 100억 규모의 이미지-텍스트 인터리브 데이터셋인 OmniCorpus를 소개합니다. 효율적인 데이터 엔진을 사용하여 대규모 고품질 문서를 필터링하고 추출하였으며, 이는 86억 개의 이미지와 1,6960억 개의 텍스트 토큰을 포함합니다. 기존 데이터셋(예: MMC4, OBELICS)과 비교하여, 우리의 데이터셋은 1) 좋은 데이터 품질을 유지하면서 15배 더 큰 규모를 가지고 있으며, 2) 영어 및 비영어 웹사이트와 비디오 중심 웹사이트를 포함한 더 다양한 소스를 특징으로 하며, 3) 더 유연하여 이미지-텍스트 인터리브 형식에서 순수 텍스트 코퍼스와 이미지-텍스트 쌍으로 쉽게 저하될 수 있습니다. 포괄적인 분석과 실험을 통해 제안된 데이터셋의 품질, 사용성 및 효과성을 검증하였습니다. 이 연구가 향후 멀티모달 모델 연구를 위한 견고한 데이터 기반을 제공할 수 있기를 바랍니다. 코드와 데이터는 https://github.com/OpenGVLab/OmniCorpus에서 공개되었습니다.

English

Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.

OmniCorpus: 100억 수준의 이미지와 텍스트가 교차된 통합 멀티모달 코퍼스

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

초록

Support