EDU-CIRCUIT-HW: 대학 수준 STEM 과목 학생들의 실제 손글씨 풀이에 대한 다중 모달 대규모 언어 모델 평가

초록

다중모드 대규모 언어 모델(MLLM)은 전통적인 교육 방식을 혁신하고 교사의 업무 부담을 줄이는 데 상당한 가능성을 지니고 있습니다. 그러나 실제적이고 도메인 특화된 벤치마크의 부족으로 인해 수학 공식, 도형, 텍스트 기반 추론이 복잡하게 얽힌 제약 없는 STEM 분야 학생 필기 해답을 정확하게 해석하는 것은 여전히 큰 과제로 남아 있습니다. 또한 현재의 평가 방식은 하류 작업 결과(예: 자동 채점)에 주로 의존하는데, 이는 인식된 콘텐츠의 일부만을 검증할 뿐 MLLM의 복잡한 필기 논리에 대한 종합적 이해도를 제대로 파악하지 못합니다. 이러한 격차를 해소하기 위해 본 연구에서는 대학 수준 STEM 강좌에서 수집한 1,300개 이상의 실제 학생 필기 해답으로 구성된 EDU-CIRCUIT-HW 데이터세트를 공개합니다. 전문가 검증을 거친 해답 원문 필사본 및 채점 보고서를 활용하여 다양한 MLLM의 상위 단계 인식 정확도와 하위 단계 자동 채점 성능을 동시에 평가했습니다. 평가 결과, MLLM이 인식한 학생 필기 내용 내에서 놀라울 정도의 잠재적 오류가 발견되었으며, 이는 높은 위험성을 지닌 교육 환경에서 자동 채점 및 기타 이해 중심 응용 프로그램에 대한 모델의 신뢰성이 아직 불충분함을 보여줍니다. 가능한 해결책으로, 인식된 오류 유형을 활용하여 사전에 오류를 탐지하고 수정하는 사례 연구를 제시합니다. 이 방법은 최소한의 인간 개입(예: 과제의 3.3%는 인간 채점자에게, 나머지는 GPT-5.1 채점자에게 배분)만으로도 배포된 AI 기반 채점 시스템의 강건성을 효과적으로 높일 수 있음을 입증했습니다. 코드와 데이터세트는 다음 GitHub 저장소에서 이용할 수 있습니다: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL.

English

Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs' upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models' insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. As a potential solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and correct recognition errors, while requiring only minimal human intervention (e.g., routing 3.3% of assignments to human graders and the remainder to the GPT-5.1 grader), can effectively enhance the robustness of the deployed AI-enabled grading system. Code and dataset are available in this GitHub repo: https://gt-learning-innovation.github.io/CIRCUIT_EDU_HW_ACL.

EDU-CIRCUIT-HW: 대학 수준 STEM 과목 학생들의 실제 손글씨 풀이에 대한 다중 모달 대규모 언어 모델 평가

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

초록

Support