MathSE:通过自进化迭代反思与奖励引导微调提升多模态数学推理能力
MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning
November 10, 2025
作者: Jinhao Chen, Zhen Yang, Jianxin Shi, Tianyu Wo, Jie Tang
cs.AI
摘要
多模态大语言模型(MLLMs)在视觉语言问答任务中展现出卓越能力。尽管优势显著,这些模型在实现复杂推理任务(如数学解题)时仍面临挑战。现有研究主要集中于对专用数学数据集进行微调,然而此类数据集通常直接由教师模型蒸馏得到,仅能捕捉静态推理模式,与学生模型存在显著差距。这种对固定教师衍生数据的依赖不仅限制了模型适应训练数据边界之外的新颖或复杂问题的能力,还缺乏实现强泛化所需的迭代深度。为突破这些局限,我们提出\method——一种面向MLLMs的数学自演进框架。与传统的一次性微调范式不同,\method通过推理、反思和基于奖励的反馈循环对模型进行迭代优化。具体而言,我们通过融入前阶段推理得到的正确解题路径,并整合专用结果奖励模型(ORM)的反思实现迭代微调。为验证\method的有效性,我们在系列挑战性基准测试上开展评估,结果表明其相较骨干模型取得显著性能提升。值得注意的是,我们在MathVL-test上的实验结果超越了领先的开源多模态数学推理模型QVQ。代码与模型已开源:https://zheny2751.github.io/MathSE.github.io/。
English
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language answering tasks. Despite their strengths, these models often encounter challenges in achieving complex reasoning tasks such as mathematical problem-solving. Previous works have focused on fine-tuning on specialized mathematical datasets. However, these datasets are typically distilled directly from teacher models, which capture only static reasoning patterns and leaving substantial gaps compared to student models. This reliance on fixed teacher-derived datasets not only restricts the model's ability to adapt to novel or more intricate questions that extend beyond the confines of the training data, but also lacks the iterative depth needed for robust generalization. To overcome these limitations, we propose \method, a Mathematical Self-Evolving framework for MLLMs. In contrast to traditional one-shot fine-tuning paradigms, \method iteratively refines the model through cycles of inference, reflection, and reward-based feedback. Specifically, we leverage iterative fine-tuning by incorporating correct reasoning paths derived from previous-stage inference and integrating reflections from a specialized Outcome Reward Model (ORM). To verify the effectiveness of \method, we evaluate it on a suite of challenging benchmarks, demonstrating significant performance gains over backbone models. Notably, our experimental results on MathVL-test surpass the leading open-source multimodal mathematical reasoning model QVQ. Our code and models are available at https://zheny2751\allowbreak-dotcom.github.io/\allowbreak MathSE.github.io/.