CodePlot-CoT:通过代码驱动图像进行数学视觉推理
CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
October 13, 2025
作者: Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, Xihui Liu
cs.AI
摘要
近期,大型语言模型(LLMs)与视觉语言模型(VLMs)在数学推理方面取得了显著进展,但在处理需要视觉辅助的问题时,如绘制辅助线或函数图像以求解问题,仍面临关键瓶颈。大多数LLMs和VLMs仅限于纯文本推理链,而能够生成交错文本与图像的多模态统一模型,则缺乏此类任务所需的精确性和可控性。为此,我们提出了CodePlot-CoT,一种代码驱动的“思维链”范式,用于数学中的“图像思维”。该方法利用VLM生成文本推理及可执行的绘图代码,随后将这些代码渲染成图像作为“视觉思维”,以解决数学问题。为实现这一目标,我们首先构建了Math-VR,这是首个大规模、双语数学视觉推理问题数据集及基准,包含178K样本。其次,为创建高质量训练数据,我们开发了一种先进的图像到代码转换器,专门用于将复杂数学图形解析为代码。最后,利用这些训练数据,我们训练了CodePlot-CoT模型以解决数学问题。实验结果显示,该模型在我们的新基准上较基础模型提升了高达21%,充分验证了我们提出的代码驱动推理范式的有效性。我们的工作为多模态数学推理开辟了新方向,并为社区提供了首个大规模数据集、全面基准及针对此类问题的强大方法。为促进未来研究,我们在https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT公开了数据集、代码及预训练模型。
English
Recent advances in Large Language Models (LLMs) and Vision Language Models
(VLMs) have shown significant progress in mathematical reasoning, yet they
still face a critical bottleneck with problems requiring visual assistance,
such as drawing auxiliary lines or plotting functions to solve the problems.
Most LLMs and VLMs are constrained to text-only reasoning chains, while
multimodal unified models that can generate interleaved text and images lack
the necessary precision and controllability for such tasks. To address this, we
propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking
with images" in mathematics. Our approach leverages the VLM to generate text
reasoning as well as executable plotting code, which is then rendered into
images as "visual thought", to solve mathematical problems. To achieve this, we
first construct Math-VR, the first large-scale, bilingual dataset and benchmark
for Mathematics problems with Visual Reasoning, comprising 178K samples.
Second, to create high-quality training data, we develop a state-of-the-art
image-to-code converter specialized for parsing complex mathematical figures
into codes. Finally, using these training data, we train the CodePlot-CoT model
for solving mathematical problems. Experimental results show that our model
achieves up to 21% increase over base model on our new benchmark, fully
validating the efficacy of our proposed code-driven reasoning paradigm. Our
work opens a new direction for multimodal mathematical reasoning and provides
the community with the first large-scale dataset, comprehensive benchmark, and
strong approach for such problems. To facilitate future research, we make our
datasets, code, and pretrained models publicly available at
https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.