ChatPaper.aiChatPaper

CodePlot-CoT:通過代碼驅動圖像進行數學視覺推理的思維方法

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

October 13, 2025
作者: Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, Xihui Liu
cs.AI

摘要

大型語言模型(LLMs)與視覺語言模型(VLMs)在數學推理方面已取得顯著進展,然而在處理需要視覺輔助的問題時,如繪製輔助線或繪製函數圖像來解決問題,仍面臨關鍵瓶頸。大多數LLMs和VLMs受限於僅能進行文本推理鏈,而能夠生成交錯文本與圖像的多模態統一模型則缺乏此類任務所需的精確性和可控性。為此,我們提出了CodePlot-CoT,這是一種代碼驅動的思維鏈範式,用於在數學中“以圖像思考”。我們的方法利用VLM生成文本推理以及可執行的繪圖代碼,隨後將這些代碼渲染成圖像作為“視覺思維”,以解決數學問題。為實現這一目標,我們首先構建了Math-VR,這是首個大規模、雙語的視覺推理數學問題數據集和基準,包含178K個樣本。其次,為創建高質量的訓練數據,我們開發了一種最先進的圖像到代碼轉換器,專門用於將複雜的數學圖形解析為代碼。最後,利用這些訓練數據,我們訓練了CodePlot-CoT模型來解決數學問題。實驗結果顯示,在我們的新基準上,我們的模型相比基礎模型提升了高達21%,充分驗證了我們提出的代碼驅動推理範式的有效性。我們的工作為多模態數學推理開闢了新的方向,並為社區提供了首個大規模數據集、全面的基準以及針對此類問題的強力方法。為促進未來研究,我們將數據集、代碼和預訓練模型公開於https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT。
English
Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as "visual thought", to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.
PDF132October 14, 2025