多模态大语言模型能读懂学生思维吗?手写数学解题中的多模态错误分析解构
Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math
March 26, 2026
作者: Dingjie Song, Tianlong Xu, Yi-Fan Zhang, Hang Li, Zhiling Yan, Xing Fan, Haoyang Li, Lichao Sun, Qingsong Wen
cs.AI
摘要
评估学生手写演算过程对于实现个性化教育反馈至关重要,但由于笔迹多样性、版面布局复杂性及解题方法差异性,这项任务面临独特挑战。现有教育领域自然语言处理技术主要聚焦于文本回答,忽视了真实手写演算所固有的复杂性与多模态特性。当前多模态大语言模型虽在视觉推理方面表现出色,但通常采用"应试者视角",优先追求生成正确答案而非诊断学生错误。为弥补这些不足,我们推出ScratchMath——一个专门为解释和分类真实手写数学演算错误而设计的新型基准测试框架。该数据集包含1,720份来自中国中小学学生的数学演算样本,支持错误原因解释和错误原因分类两大核心任务,并定义了七种错误类型。通过包含专家多轮标注、复核与验证的人机协同标注流程,数据集实现了精细化标注。我们对16个主流多模态大语言模型进行系统评估,发现其在视觉识别和逻辑推理方面与人类专家存在显著差距。其中闭源模型表现明显优于开源模型,大型推理模型在错误解释任务中展现出强大潜力。所有评估数据与框架均已开源,以推动相关研究发展。
English
Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.