ChartM^3: チャート理解における多次元・多段階視覚推論データ構築のための多段階コード駆動パイプライン

要旨

複雑なチャート理解タスクでは、マルチモーダル大規模言語モデル（MLLM）に高度な視覚認識能力と推論能力が求められる。しかし、現状の研究では実世界アプリケーションで普及している複雑なチャートシナリオや計算集約型推論タスクへの対応が限られている。本研究では、これらの課題を解決するため、体系的な視覚推論データセット生成を目的とした自動化された多段階コード駆動パイプラインを提案する。本パイプラインは、専門的なチャートテンプレートを取得するための検索拡張生成（RAG）を統合し、実データ分布を模倣する推論コードを生成するために連鎖思考（CoT）戦略を採用することで、チャートの描画と質問に関連する統計計算を駆動する。モデルベースの評価を通じて、本パイプラインはチャートの多様性とデータ品質を向上させる。このフレームワークを用いて、我々はChartM^3を構築した。これは、学習用に38Kのチャートと142KのQ&Aペアを含む多次元かつ多段階のデータセットであり、実用的な性能評価を可能にする2,871の高品質な評価サンプルを備える。教師ありファインチューニング（SFT）および強化学習（RL）による実験により、本データセットが推論能力とクロスドメイン汎化性能を大幅に改善し、より小規模なモデルが複雑なチャート理解において大規模モデルに匹敵する性能を達成できることが実証された。

English

Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM^3, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.

ChartM^3: チャート理解における多次元・多段階視覚推論データ構築のための多段階コード駆動パイプライン

ChartM^3: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

要旨

Support