MathFlow: 視覚的数学問題におけるMLLMの知覚的フローの向上

要旨

多様なタスクにおいて印象的な性能を発揮しているにもかかわらず、マルチモーダル大規模言語モデル（MLLMs）は、特に図表を正確に認識し解釈する能力において、視覚的数学問題解決の可能性を完全には示していません。人間の典型的なプロセスに着想を得て、我々は、図表から意味のある情報を抽出する知覚能力が重要であると仮定しました。なぜなら、それはその後の推論プロセスに直接影響を与えるからです。この仮説を検証するため、我々はFlowVerseを開発しました。これは、問題解決中に使用されるすべての情報を4つの要素に分類し、それらを組み合わせて6つの問題バージョンを作成し評価する包括的なベンチマークです。FlowVerseにおける予備的な結果は、既存のMLLMsが、図表から本質的な情報と推論された特性を抽出し、これらの視覚的入力に基づいて複雑な推論を行う際に、重大な制限を示していることを明らかにしました。これに対応して、我々はMathFlowを導入しました。これは、知覚と推論を異なる段階に分離し、それぞれを独立して最適化するモジュール型問題解決パイプラインです。現在のMLLMsで観察された知覚的制限を考慮し、我々は専用の知覚モデルとしてMathFlow-P-7Bを訓練しました。実験結果は、MathFlow-P-7Bが、さまざまなクローズドソースおよびオープンソースの推論モデルと統合された場合に、大幅な性能向上をもたらすことを示しています。これは、MathFlowパイプラインの有効性と、多様な推論フレームワークとの互換性を実証しています。FlowVerseベンチマークとコードは、https://github.com/MathFlow-zju/MathFlow で利用可能です。

English

Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.

MathFlow: 視覚的数学問題におけるMLLMの知覚的フローの向上

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

要旨

Support