EDU-CIRCUIT-HW：基于真实大学STEM课程手写解题方案的多模态大语言模型评估框架

摘要

多模态大语言模型（MLLMs）在革新传统教育模式和减轻教师工作负荷方面展现出巨大潜力。然而，由于缺乏真实且领域专用的基准数据集，准确解读学生不受约束的STEM学科手写解答（包含相互交织的数学公式、图表和文本推理）仍面临重大挑战。现有评估范式主要依赖下游任务（如自动评分）的结果，这类评估通常仅探查已识别内容的部分维度，难以全面捕捉MLLMs对复杂手写逻辑的整体理解能力。为弥补这一空白，我们发布了EDU-CIRCUIT-HW数据集，该数据集包含来自大学STEM课程的1300余份真实学生手写解答。通过利用经专家核验的逐字转录文本和评分报告，我们同步评估了多种MLLMs的上游识别准确度与下游自动评分性能。研究发现，MLLMs识别的手写内容中存在惊人规模的潜在错误，表明当前模型在高风险教育场景中尚未具备适用于自动评分及其他理解导向应用的可靠性。作为潜在解决方案，我们通过案例研究证明：利用已识别的错误模式进行预检与修正（仅需极少人力干预，如将3.3%的作业分配至人工评分，其余由GPT-5.1评分器处理），可有效增强部署型AI评分系统的鲁棒性。代码与数据集详见GitHub仓库：https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL。

English

Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. As a potential solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and correct recognition errors, while requiring only minimal human intervention (e.g., routing 3.3% of assignments to human graders and the remainder to the GPT-5.1 grader), can effectively enhance the robustness of the deployed AI-enabled grading system. Code and dataset are available in this GitHub repo: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL.

EDU-CIRCUIT-HW：基于真实大学STEM课程手写解题方案的多模态大语言模型评估框架

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

摘要

Support