PIN:一個知識密集型的資料集,用於成對和交錯的多模態文件。
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
June 20, 2024
作者: Junjie Wang, Yin Zhang, Yatai Ji, Yuxiang Zhang, Chunyang Jiang, Yubo Wang, Kang Zhu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Bei Chen, Qunshu Lin, Minghao Liu, Ge Zhang, Wenhu Chen
cs.AI
摘要
近期在大型多模型模型(LMMs)方面的最新進展已利用廣泛的多模型數據集來增強在複雜知識驅動任務中的能力。然而,感知和推理錯誤方面的持續挑戰限制了它們的效力,特別是在解釋複雜視覺數據和推斷多模型關係方面。為應對這些問題,我們引入了一種新的數據集格式,稱為PIN(配對和交錯多模型文檔),旨在顯著提高多模型訓練的深度和廣度。PIN格式建立在三個基本原則上:知識密度、可擴展性和對多樣訓練模式的支持。這種創新格式結合了markdown文件和全面的圖像,通過密集的知識結構和多樣的訓練策略豐富了訓練數據。我們提出了PIN-14M,這是一個開源數據集,包括了從各種中英文來源中獲得的1400萬樣本,旨在包含複雜的網絡和科學內容。這個數據集被精心構建,以確保數據質量和道德完整性,旨在促進先進的訓練策略,提高模型對常見多模型訓練陷阱的韌性。我們的初步結果奠定了這份技術報告的基礎,表明PIN格式在提升LMM性能方面具有顯著潛力,並計劃未來擴展和對其對模型能力的影響進行詳細評估。
English
Recent advancements in Large Multimodal Models (LMMs) have leveraged
extensive multimodal datasets to enhance capabilities in complex
knowledge-driven tasks. However, persistent challenges in perceptual and
reasoning errors limit their efficacy, particularly in interpreting intricate
visual data and deducing multimodal relationships. Addressing these issues, we
introduce a novel dataset format, PIN (Paired and INterleaved multimodal
documents), designed to significantly improve both the depth and breadth of
multimodal training. The PIN format is built on three foundational principles:
knowledge intensity, scalability, and support for diverse training modalities.
This innovative format combines markdown files and comprehensive images to
enrich training data with a dense knowledge structure and versatile training
strategies. We present PIN-14M, an open-source dataset comprising 14 million
samples derived from a diverse range of Chinese and English sources, tailored
to include complex web and scientific content. This dataset is constructed
meticulously to ensure data quality and ethical integrity, aiming to facilitate
advanced training strategies and improve model robustness against common
multimodal training pitfalls. Our initial results, forming the basis of this
technical report, suggest significant potential for the PIN format in refining
LMM performance, with plans for future expansions and detailed evaluations of
its impact on model capabilities.Summary
AI-Generated Summary