OmniCorpus:一個包含100億級別圖像並與文本交錯的統一多模態語料庫
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
June 12, 2024
作者: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang, Min Dou, Changyao Tian, Xizhou Zhu, Lewei Lu, Yushi Chen, Junjun He, Zhongying Tu, Tong Lu, Yali Wang, Limin Wang, Dahua Lin, Yu Qiao, Botian Shi, Conghui He, Jifeng Dai
cs.AI
摘要
圖像文本交錯數據是由多個圖像和文本以自然文件格式排列組合而成,符合互聯網數據呈現範式,並密切符合人類閱讀習慣。最近的研究表明,這樣的數據有助於多模態上下文學習,並在多模態微調期間保持大型語言模型的能力。然而,目前圖像文本交錯數據的規模和多樣性有限,限制了多模態大型語言模型的發展。在本文中,我們介紹了 OmniCorpus,一個規模達 100 億的圖像文本交錯數據集。通過高效的數據引擎,我們過濾並提取了包含 86 億圖像和 1,696 億文本標記的大規模高質量文檔。與同行(如 MMC4、OBELICS)相比,我們的數據集 1)規模大 15 倍,同時保持良好的數據質量;2)來源更加多樣,包括英語和非英語網站以及以視頻為中心的網站;3)更加靈活,可輕鬆從圖像文本交錯格式轉換為純文本語料庫和圖像文本對。通過全面分析和實驗,我們驗證了所提出數據集的質量、可用性和有效性。我們希望這可以為未來多模態模型研究提供堅實的數據基礎。代碼和數據已在 https://github.com/OpenGVLab/OmniCorpus 上發布。
English
Image-text interleaved data, consisting of multiple images and texts arranged
in a natural document format, aligns with the presentation paradigm of internet
data and closely resembles human reading habits. Recent studies have shown that
such data aids multimodal in-context learning and maintains the capabilities of
large language models during multimodal fine-tuning. However, the limited scale
and diversity of current image-text interleaved data restrict the development
of multimodal large language models. In this paper, we introduce OmniCorpus, a
10 billion-scale image-text interleaved dataset. Using an efficient data
engine, we filter and extract large-scale high-quality documents, which contain
8.6 billion images and 1,696 billion text tokens. Compared to counterparts
(e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while
maintaining good data quality; 2) features more diverse sources, including both
English and non-English websites as well as video-centric websites; 3) is more
flexible, easily degradable from an image-text interleaved format to pure text
corpus and image-text pairs. Through comprehensive analysis and experiments, we
validate the quality, usability, and effectiveness of the proposed dataset. We
hope this could provide a solid data foundation for future multimodal model
research. Code and data are released at
https://github.com/OpenGVLab/OmniCorpus.