MathFlow: 시각적 수학 문제를 위한 MLLM의 지각적 흐름 강화

초록

다양한 작업에서 인상적인 성능을 보여주고 있음에도 불구하고, 멀티모달 대형 언어 모델(MLLMs)은 시각적 수학 문제 해결, 특히 도형을 정확하게 인지하고 해석하는 데 있어서 아직 그 잠재력을 완전히 발휘하지 못하고 있습니다. 인간의 전형적인 문제 해결 과정에서 영감을 받아, 우리는 도형에서 의미 있는 정보를 추출하는 인지 능력이 후속 추론 과정에 직접적인 영향을 미치기 때문에 중요하다는 가설을 세웠습니다. 이 가설을 검증하기 위해, 우리는 문제 해결 과정에서 사용되는 모든 정보를 네 가지 구성 요소로 분류하고 이를 결합하여 여섯 가지 문제 버전으로 평가할 수 있는 포괄적인 벤치마크인 FlowVerse를 개발했습니다. FlowVerse에 대한 초기 결과는 기존 MLLMs가 도형에서 필수적인 정보와 추론 속성을 추출하고 이러한 시각적 입력을 기반으로 복잡한 추론을 수행하는 데 상당한 한계를 보인다는 것을 나타냅니다. 이에 대응하여, 우리는 인지와 추론을 별도의 단계로 분리하여 각각을 독립적으로 최적화하는 모듈식 문제 해결 파이프라인인 MathFlow를 소개합니다. 현재 MLLMs에서 관찰된 인지적 한계를 고려하여, 우리는 전용 인지 모델로 MathFlow-P-7B를 훈련했습니다. 실험 결과는 MathFlow-P-7B가 다양한 클로즈드 소스 및 오픈 소스 추론 모델과 통합될 때 상당한 성능 향상을 가져온다는 것을 보여줍니다. 이는 MathFlow 파이프라인의 효과성과 다양한 추론 프레임워크와의 호환성을 입증합니다. FlowVerse 벤치마크와 코드는 https://github.com/MathFlow-zju/MathFlow에서 확인할 수 있습니다.

English

Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at https://github.com/MathFlow-zju/MathFlow.

MathFlow: 시각적 수학 문제를 위한 MLLM의 지각적 흐름 강화

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

초록

Support