PIN:一种用于成对和交错多模态文档的知识密集型数据集
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
June 20, 2024
作者: Junjie Wang, Yin Zhang, Yatai Ji, Yuxiang Zhang, Chunyang Jiang, Yubo Wang, Kang Zhu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Bei Chen, Qunshu Lin, Minghao Liu, Ge Zhang, Wenhu Chen
cs.AI
摘要
最近在大型多模态模型(LMMs)方面取得的进展利用了广泛的多模态数据集,以增强在复杂知识驱动任务中的能力。然而,感知和推理错误方面的持续挑战限制了它们的有效性,特别是在解释复杂视觉数据和推断多模态关系方面。为了解决这些问题,我们引入了一种新颖的数据集格式,称为PIN(配对和交错多模态文档),旨在显著提高多模态训练的深度和广度。PIN格式建立在三个基本原则上:知识密集、可扩展性以及支持多样化训练模态。这种创新格式结合了Markdown文件和全面的图像,通过密集的知识结构和多样化的训练策略丰富了训练数据。我们提出了PIN-14M,这是一个开源数据集,包括来自各种中英文来源的1400万个样本,旨在包含复杂的网络和科学内容。这个数据集经过精心构建,以确保数据质量和道德完整性,旨在促进先进的训练策略,并提高模型对常见多模态训练陷阱的鲁棒性。我们的初步结果构成了本技术报告的基础,表明PIN格式在优化LMM性能方面具有显著潜力,未来计划扩展并详细评估其对模型能力的影响。
English
Recent advancements in Large Multimodal Models (LMMs) have leveraged
extensive multimodal datasets to enhance capabilities in complex
knowledge-driven tasks. However, persistent challenges in perceptual and
reasoning errors limit their efficacy, particularly in interpreting intricate
visual data and deducing multimodal relationships. Addressing these issues, we
introduce a novel dataset format, PIN (Paired and INterleaved multimodal
documents), designed to significantly improve both the depth and breadth of
multimodal training. The PIN format is built on three foundational principles:
knowledge intensity, scalability, and support for diverse training modalities.
This innovative format combines markdown files and comprehensive images to
enrich training data with a dense knowledge structure and versatile training
strategies. We present PIN-14M, an open-source dataset comprising 14 million
samples derived from a diverse range of Chinese and English sources, tailored
to include complex web and scientific content. This dataset is constructed
meticulously to ensure data quality and ethical integrity, aiming to facilitate
advanced training strategies and improve model robustness against common
multimodal training pitfalls. Our initial results, forming the basis of this
technical report, suggest significant potential for the PIN format in refining
LMM performance, with plans for future expansions and detailed evaluations of
its impact on model capabilities.