CogFlow:通过知识内化连接感知与推理,实现视觉数学问题求解
CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving
January 5, 2026
作者: Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, Hangjie Yuan
cs.AI
摘要
尽管取得了显著进展,多模态大语言模型在视觉数学问题求解方面仍面临挑战。近期研究认识到视觉感知是数学推理的瓶颈,但其解决方案仅限于改进视觉信息的提取与解读。值得注意的是,这些研究都忽略了一个关键问题:提取的视觉线索是否被忠实整合并有效运用于后续推理。受此启发,我们提出CogFlow——一个受认知科学启发的三阶段框架,通过增设知识内化阶段显式模拟人类推理的层次化流程:感知→内化→推理。基于这一流程,我们系统性强化了所有阶段:设计协同视觉奖励机制,在参数空间与语义空间共同提升符号和图表的信息提取能力;引入知识内化奖励模型确保视觉线索与推理过程的忠实融合;提出视觉门控策略优化算法,防止模型产生表面连贯但脱离视觉依据的推理捷径。此外,我们构建了包含12万条高质量感知-推理对齐标注的新数据集MathCog。在主流视觉数学推理基准上的全面实验与分析验证了CogFlow的优越性。
English
Despite significant progress, multimodal large language models continue to struggle with visual mathematical problem solving. Some recent works recognize that visual perception is a bottleneck in visual mathematical reasoning, but their solutions are limited to improving the extraction and interpretation of visual inputs. Notably, they all ignore the key issue of whether the extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning. Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perceptionRightarrowinternalizationRightarrowreasoning. Inline with this hierarchical flow, we holistically enhance all its stages. We devise Synergistic Visual Rewards to boost perception capabilities in parametric and semantic spaces, jointly improving visual information extraction from symbols and diagrams. To guarantee faithful integration of extracted visual cues into subsequent reasoning, we introduce a Knowledge Internalization Reward model in the internalization stage, bridging perception and reasoning. Moreover, we design a Visual-Gated Policy Optimization algorithm to further enforce the reasoning is grounded with the visual knowledge, preventing models seeking shortcuts that appear coherent but are visually ungrounded reasoning chains. Moreover, we contribute a new dataset MathCog for model training, which contains samples with over 120K high-quality perception-reasoning aligned annotations. Comprehensive experiments and analysis on commonly used visual mathematical reasoning benchmarks validate the superiority of the proposed CogFlow.