MathFlow：提升多模态大语言模型在视觉数学问题中的感知流畅性

摘要

尽管多模态大语言模型（MLLMs）在多种任务中展现出卓越性能，但在视觉数学问题解决方面，尤其是在准确感知和解读图表方面，尚未充分展现其潜力。受人类典型解题过程的启发，我们假设从图表中提取有意义信息的感知能力至关重要，因为它直接影响后续的推理过程。为验证这一假设，我们开发了FlowVerse，一个全面基准测试，将解题过程中使用的所有信息分为四个组成部分，并组合成六个问题版本进行评估。我们在FlowVerse上的初步结果显示，现有的MLLMs在从图表中提取关键信息和推理属性，以及基于这些视觉输入进行复杂推理时，存在显著局限。为此，我们提出了MathFlow，一个模块化的问题解决流程，将感知与推理解耦为独立阶段，从而分别优化。鉴于当前MLLMs在感知方面的局限，我们训练了MathFlow-P-7B作为专门的感知模型。实验结果表明，当MathFlow-P-7B与各种闭源和开源的推理模型集成时，带来了显著的性能提升。这证明了MathFlow流程的有效性及其与多种推理框架的兼容性。FlowVerse基准测试和代码可在https://github.com/MathFlow-zju/MathFlow获取。

English

Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.

MathFlow：提升多模态大语言模型在视觉数学问题中的感知流畅性

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

摘要

Support