ChatPaper.aiChatPaper

MINT-CoT:在数学思维链推理中实现视觉标记的交错嵌入

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

June 5, 2025
作者: Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li
cs.AI

摘要

链式思维(CoT)已显著提升了大型语言模型(LLMs)在数学推理方面的能力,但将其扩展至多模态领域仍面临挑战。现有研究要么对图像输入采用类似的文本推理方法,要么尝试将视觉信号交织到数学CoT中。然而,它们在解决数学问题时存在三个关键局限:依赖粗粒度的矩形图像区域、视觉编码器对数学内容的感知有限,以及对外部视觉修改能力的依赖。本文提出MINT-CoT,通过引入数学交织标记(Mathematical INterleaved Tokens)来实现链式思维视觉推理。MINT-CoT利用交织标记自适应地将相关视觉标记融入文本推理步骤,该标记能动态选择数学图形内任意形状的视觉区域。为增强这一能力,我们构建了MINT-CoT数据集,包含54K个数学问题,每个推理步骤都与标记级别的视觉区域精确对齐,并配备严格的数据生成流程。此外,我们提出了三阶段MINT-CoT训练策略,逐步结合纯文本CoT SFT、交织CoT SFT和交织CoT RL,最终得到MINT-CoT-7B模型。大量实验验证了该方法在数学领域进行有效视觉交织推理的优越性,MINT-CoT-7B在MathVista、GeoQA和MMStar上的表现分别比基线模型高出+34.08%、+28.78%和+23.2%。我们的代码和数据可在https://github.com/xinyan-cxy/MINT-CoT获取。
English
Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT
PDF121June 6, 2025