InfiMM-WebMath-40B: 数学的推論の向上のためのマルチモーダル事前学習の前進

要旨

大規模で高品質なデータセットでの事前学習は、特に数学などの専門分野において、大規模言語モデル（LLM）の推論能力を向上させるために重要です。その重要性が認識されているにもかかわらず、現在の多モーダルLLM（MLLM）分野には、数学的推論に特化した包括的なオープンソースの事前学習データセットが不足しています。このギャップを埋めるために、私たちはInfiMM-WebMath-40Bを導入します。これは、交互に配置された画像テキストドキュメントの高品質データセットであり、CommonCrawlから丹念に抽出およびフィルタリングされた2,400万のウェブページ、8,500万の関連画像URL、および400億のテキストトークンから構成されています。私たちは、データ収集と処理パイプラインの詳細な概要を提供します。InfiMM-WebMath-40Bの堅牢性を示すために、テキストのみおよびマルチモーダルの設定で評価を行いました。テキストのみのベンチマーク評価では、400億のトークンのみを使用しているにもかかわらず、1.3Bモデルのパフォーマンスを著しく向上させ、DeepSeekMath-1.3Bと同じモデルサイズに対して1200億のトークンを使用するDeepSeekMath-1.3Bと同等の結果を提供しています。それにもかかわらず、私たちのマルチモーダル数学事前学習データセットを導入することで、私たちのモデルはMathVerseやWe-Mathなどのマルチモーダル数学ベンチマークにおいてオープンソースモデルの最先端を示しています。私たちは、データをhttps://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40Bで公開しています。

English

Pre-training on large-scale, high-quality datasets is crucial for enhancing the reasoning capabilities of Large Language Models (LLMs), especially in specialized domains such as mathematics. Despite the recognized importance, the Multimodal LLMs (MLLMs) field currently lacks a comprehensive open-source pre-training dataset specifically designed for mathematical reasoning. To address this gap, we introduce InfiMM-WebMath-40B, a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. We provide a detailed overview of our data collection and processing pipeline. To demonstrate the robustness of InfiMM-WebMath-40B, we conducted evaluations in both text-only and multimodal settings. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model, delivering results comparable to DeepSeekMath-1.3B, which uses 120 billion tokens for the same model size. Nevertheless, with the introduction of our multi-modal math pre-training dataset, our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math. We release our data at https://huggingface.co/datasets/Infi-MM/InfiMM-WebMath-40B.

InfiMM-WebMath-40B: 数学的推論の向上のためのマルチモーダル事前学習の前進

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

要旨

Support