ChatPaper.aiChatPaper

MathCanvas:面向多模态数学推理的内在视觉思维链

MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

October 16, 2025
作者: Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, Hongsheng Li
cs.AI

摘要

尽管大型语言模型(LLMs)在文本推理方面表现出色,但在几何等本质上依赖视觉辅助的数学领域却显得力不从心。现有的视觉思维链(VCoT)方法常受限于僵化的外部工具,或无法生成复杂问题解决所需的高保真、策略性时机的图表。为弥合这一差距,我们推出了MathCanvas,一个旨在赋予统一大型多模态模型(LMMs)内在数学VCoT能力的综合框架。我们的方法分为两个阶段。首先,在视觉操控阶段,模型通过一个包含10M图文对(MathCanvas-Imagen)和5.2M逐步编辑轨迹(MathCanvas-Edit)的新颖1520万对语料库进行预训练,以掌握图表生成与编辑。其次,在策略性视觉辅助推理阶段,模型在MathCanvas-Instruct上微调,这是一个包含219K例交错视觉-文本推理路径的新数据集,教导模型何时及如何利用视觉辅助。为促进严格评估,我们引入了MathCanvas-Bench,一个包含3K需模型生成交错视觉-文本解答的难题的挑战性基准。在此框架下训练的BAGEL-Canvas模型,在MathCanvas-Bench上相较于强LMM基线实现了86%的相对提升,展现出对其他公共数学基准的优异泛化能力。我们的工作提供了一套完整的工具包——框架、数据集及基准——以解锁LMMs中复杂、类人的视觉辅助推理。项目页面:https://mathcanvas.github.io/
English
While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/
PDF222October 17, 2025