ChatPaper.aiChatPaper

MathSE:透過自我演進的迭代反思與獎勵引導微調提升多模態數學推理能力

MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning

November 10, 2025
作者: Jinhao Chen, Zhen Yang, Jianxin Shi, Tianyu Wo, Jie Tang
cs.AI

摘要

多模態大型語言模型(MLLMs)在視覺語言問答任務中展現出卓越能力。儘管優勢顯著,這類模型在實現複雜推理任務(如數學解題)時仍常面臨挑戰。先前研究主要聚焦於針對專業數學資料集進行微調,然而這些資料集通常直接從教師模型蒸餾而得,僅能捕捉靜態推理模式,與學生模型存在顯著差距。這種對固定教師衍生資料集的依賴,不僅限制了模型適應超越訓練資料範疇的新穎或複雜問題的能力,更缺乏實現穩健泛化所需的迭代深度。為突破這些限制,我們提出\method——一個面向MLLMs的數學自進化框架。有別於傳統一次性微調範式,\method透過推理、反思與獎勵回饋的循環迭代優化模型。具體而言,我們透過整合來自前階段推理的正確解題路徑,並結合專用結果獎勵模型(ORM)的反思意見來實現迭代微調。為驗證\method的有效性,我們在系列高難度基準測試上進行評估,結果顯示其相較於骨幹模型實現顯著性能提升。值得注意的是,我們在MathVL-test上的實驗結果超越了當前領先的開源多模態數學推理模型QVQ。相關程式碼與模型已公開於:https://zheny2751\allowbreak-dotcom.github.io/\allowbreak MathSE.github.io/。
English
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language answering tasks. Despite their strengths, these models often encounter challenges in achieving complex reasoning tasks such as mathematical problem-solving. Previous works have focused on fine-tuning on specialized mathematical datasets. However, these datasets are typically distilled directly from teacher models, which capture only static reasoning patterns and leaving substantial gaps compared to student models. This reliance on fixed teacher-derived datasets not only restricts the model's ability to adapt to novel or more intricate questions that extend beyond the confines of the training data, but also lacks the iterative depth needed for robust generalization. To overcome these limitations, we propose \method, a Mathematical Self-Evolving framework for MLLMs. In contrast to traditional one-shot fine-tuning paradigms, \method iteratively refines the model through cycles of inference, reflection, and reward-based feedback. Specifically, we leverage iterative fine-tuning by incorporating correct reasoning paths derived from previous-stage inference and integrating reflections from a specialized Outcome Reward Model (ORM). To verify the effectiveness of \method, we evaluate it on a suite of challenging benchmarks, demonstrating significant performance gains over backbone models. Notably, our experimental results on MathVL-test surpass the leading open-source multimodal mathematical reasoning model QVQ. Our code and models are available at https://zheny2751\allowbreak-dotcom.github.io/\allowbreak MathSE.github.io/.
PDF123December 1, 2025