ChatPaper.aiChatPaper

PIN:一种用于成对和交错多模态文档的知识密集型数据集

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

June 20, 2024
作者: Junjie Wang, Yin Zhang, Yatai Ji, Yuxiang Zhang, Chunyang Jiang, Yubo Wang, Kang Zhu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Bei Chen, Qunshu Lin, Minghao Liu, Ge Zhang, Wenhu Chen
cs.AI

摘要

最近在大型多模态模型(LMMs)方面取得的进展利用了广泛的多模态数据集,以增强在复杂知识驱动任务中的能力。然而,感知和推理错误方面的持续挑战限制了它们的有效性,特别是在解释复杂视觉数据和推断多模态关系方面。为了解决这些问题,我们引入了一种新颖的数据集格式,称为PIN(配对和交错多模态文档),旨在显著提高多模态训练的深度和广度。PIN格式建立在三个基本原则上:知识密集、可扩展性以及支持多样化训练模态。这种创新格式结合了Markdown文件和全面的图像,通过密集的知识结构和多样化的训练策略丰富了训练数据。我们提出了PIN-14M,这是一个开源数据集,包括来自各种中英文来源的1400万个样本,旨在包含复杂的网络和科学内容。这个数据集经过精心构建,以确保数据质量和道德完整性,旨在促进先进的训练策略,并提高模型对常见多模态训练陷阱的鲁棒性。我们的初步结果构成了本技术报告的基础,表明PIN格式在优化LMM性能方面具有显著潜力,未来计划扩展并详细评估其对模型能力的影响。
English
Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.
PDF241December 2, 2024