多模态大语言模型能读懂学生思维吗？解构手写数学题中的多模态错误分析

摘要

评估学生手写演算过程对于个性化教育反馈至关重要，但由于笔迹多样性、布局复杂性及解题方法差异性，这一任务面临独特挑战。现有教育自然语言处理技术主要聚焦文本回答，未能兼顾真实手写演算中固有的复杂性与多模态特性。当前多模态大语言模型虽在视觉推理方面表现卓越，但通常采用"应试者视角"，侧重于生成正确答案而非诊断学生错误。为填补这些空白，我们推出ScratchMath——一个专门为解释和分类真实手写数学演算错误而设计的新型基准测试。该数据集包含1,720份中国中小学生数学演算样本，支持错误原因解释和错误原因分类两大核心任务，并定义了七类错误类型。通过包含专家多轮标注、审核与验证的人机协同标注流程，数据集实现了精细化标注。我们系统评估了16个主流多模态大语言模型在ScratchMath上的表现，发现其在视觉识别和逻辑推理方面与人类专家存在显著差距。其中闭源模型明显优于开源模型，大型推理模型在错误解释方面展现出强大潜力。所有评估数据与框架均已公开，以推动相关研究进展。

English

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.