EDU-CIRCUIT-HW:基于真实大学STEM课程手写解题过程的多模态大语言模型评估
EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions
April 30, 2026
作者: Weiyu Sun, Liangliang Chen, Yongnuo Cai, Huiru Xie, Yi Zeng, Ying Zhang
cs.AI
摘要
多模态大语言模型(MLLMs)在革新传统教育模式和减轻教师工作负荷方面展现出巨大潜力。然而,由于缺乏真实且领域特定的基准数据,如何准确解读学生自由书写的STEM科目手写解答(包含交织的数学公式、图表及文本推理)仍面临重大挑战。现有评估范式主要依赖下游任务结果(如自动评分),这类方法通常仅检验被识别内容的部分特征,难以全面捕捉MLLMs对复杂手写逻辑的整体理解。为弥补这一缺陷,我们发布EDU-CIRCUIT-HW数据集,该数据集包含1,300余份大学STEM课程的真实学生手写解答。通过采用专家核验的逐字转录文本和评分报告,我们同步评估了多种MLLMs的上游识别准确度与下游自动评分性能。研究发现,MLLMs识别出的学生手写内容中存在惊人规模的潜在错误,表明当前模型在高风险教育场景下的自动评分及其他理解导向任务中尚缺乏足够可靠性。作为潜在解决方案,我们通过案例研究证明:利用已识别的错误模式预先检测并修正识别错误(仅需极少量人工干预,如将3.3%的作业分配至人工评分,其余由GPT-5.1评分器处理),可有效增强部署的AI评分系统的鲁棒性。代码与数据集详见GitHub仓库:https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL。
English
Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. As a potential solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and correct recognition errors, while requiring only minimal human intervention (e.g., routing 3.3% of assignments to human graders and the remainder to the GPT-5.1 grader), can effectively enhance the robustness of the deployed AI-enabled grading system. Code and dataset are available in this GitHub repo: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL.