ChatPaper.aiChatPaper

OmniCorpus:一个包含100亿级图像和文本交叉的统一多模态语料库

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

June 12, 2024
作者: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang, Min Dou, Changyao Tian, Xizhou Zhu, Lewei Lu, Yushi Chen, Junjun He, Zhongying Tu, Tong Lu, Yali Wang, Limin Wang, Dahua Lin, Yu Qiao, Botian Shi, Conghui He, Jifeng Dai
cs.AI

摘要

图文交错数据由多个图像和文本组成,以自然文档格式排列,符合互联网数据的呈现范式,并与人类阅读习惯密切相关。最近的研究表明,这种数据有助于多模态上下文学习,并在多模态微调期间保持大型语言模型的能力。然而,当前图文交错数据的规模和多样性有限,限制了多模态大型语言模型的发展。在本文中,我们介绍了OmniCorpus,一个规模达100亿的图文交错数据集。利用高效的数据引擎,我们过滤和提取大规模高质量文档,其中包含了86亿张图像和1696亿个文本标记。与同行(例如MMC4、OBELICS)相比,我们的数据集1)规模大15倍,同时保持良好的数据质量;2)具有更多样化的来源,包括英语和非英语网站以及以视频为中心的网站;3)更加灵活,可以轻松地从图文交错格式降级为纯文本语料库和图文对。通过全面的分析和实验,我们验证了所提出数据集的质量、可用性和有效性。我们希望这能为未来多模态模型研究提供坚实的数据基础。代码和数据已发布在https://github.com/OpenGVLab/OmniCorpus。
English
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.

Summary

AI-Generated Summary

PDF303December 6, 2024