EDU-CIRCUIT-HW: 実世界の大学レベルのSTEM科目における学生手書き解答に対するマルチモーダル大規模言語モデルの評価

要旨

マルチモーダル大規模言語モデル（MLLM）は、従来の教育を革新し教師の負荷を軽減する大きな可能性を秘めている。しかし、数学公式・図解・テキスト論述が混在する学生の自由記述手書き解答を正確に解釈することは、分野特有の実践的ベンチマーク不足により重大な課題となっている。さらに現在の評価手法は、下流タスク（自動採点など）の結果に依存することが多く、認識内容の一部のみを検証するため、手書き論理構造全体のMLLM理解力を捉えられない。この課題解決のため、大学STEM科目の1,300件超の実学生手書き解答データセットEDU-CIRCUIT-HWを公開する。専門家検証済みの文字起こしデータと採点報告書を活用し、MLLMの上流（認識精度）と下流（自動採点性能）を同時評価した結果、認識内容に驚くべき規模の潜在誤りが発見され、高利害教育場面における自動採点や理解志向応用への信頼性不足が明らかになった。解決策の一案として、特定した誤りパターンを活用した認識誤りの事前検出・修正ケーススタディを提示する。最小限の人的介入（全課題の3.3%を人間採点者に振り分け、残りをGPT-5.1採点者に委託）のみで、AI採点システムの頑健性向上が可能であることを示す。コードとデータセットはGitHubリポジトリ（https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL）で公開中。

English

Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. As a potential solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and correct recognition errors, while requiring only minimal human intervention (e.g., routing 3.3% of assignments to human graders and the remainder to the GPT-5.1 grader), can effectively enhance the robustness of the deployed AI-enabled grading system. Code and dataset are available in this GitHub repo: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL.

EDU-CIRCUIT-HW: 実世界の大学レベルのSTEM科目における学生手書き解答に対するマルチモーダル大規模言語モデルの評価

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

要旨

Support