ChatPaper.aiChatPaper

SketchJudge:基于多模态大语言模型的手绘图表评估诊断基准

SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models

January 11, 2026
作者: Yuhang Su, Mei Wang, Yaoyao Zhong, Guozhang Li, Shixing Li, Yihan Feng, Hua Huang
cs.AI

摘要

尽管多模态大语言模型(MLLMs)在视觉理解领域取得了显著进展,但在处理人类手绘草图的无结构性和模糊性时仍面临挑战。这一局限性在视觉评分这一尚未被充分探索的任务中尤为突出——该任务要求模型不仅要解决问题,还需对手绘图表中的错误进行诊断。此类诊断能力依赖于复杂的结构、语义及元认知推理。为弥补这一差距,我们推出了SketchJudge这一新型基准测试,专门用于评估MLLMs作为STEM学科手绘图表评分者的能力。SketchJudge涵盖几何、物理、图表和流程图四大领域共1,015份学生手绘答案,呈现出多样化的风格差异和典型错误类型。基于该基准的评估表明,即使是先进MLLMs的表现也显著落后于人类水平,验证了本基准在揭示当前视觉-语言对齐机制面对符号化及噪声语境时的脆弱性方面的有效性。所有数据、代码及评估脚本已公开于https://github.com/yuhangsu82/SketchJudge。
English
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark's effectiveness in exposing the fragility of current vision-language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at https://github.com/yuhangsu82/SketchJudge.
PDF22January 31, 2026