SketchJudge:基于多模态大语言模型的手绘图表评估诊断基准
SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models
January 11, 2026
作者: Yuhang Su, Mei Wang, Yaoyao Zhong, Guozhang Li, Shixing Li, Yihan Feng, Hua Huang
cs.AI
摘要
尽管多模态大语言模型(MLLMs)在视觉理解方面取得了显著进展,但在处理人类手绘草图的无结构性和模糊性时仍面临挑战。这一局限在视觉评分这一尚未充分探索的任务中尤为突出——该任务要求模型不仅要解决问题,还需诊断手绘图表中的错误。此类诊断能力依赖于复杂的结构、语义及元认知推理。为弥补这一差距,我们推出了SketchJudge这一新型基准测试,专门用于评估MLLMs对手绘STEM图表的评分能力。SketchJudge涵盖几何、物理、图表和流程图四大领域,包含1,015份风格各异且具有典型错误类型的学生手绘作答。基于该基准的评估表明,即使先进MLLMs的表现也显著落后于人类,验证了本基准在揭示当前视觉-语言对齐机制面对符号化及噪声场景时的脆弱性方面的有效性。所有数据、代码及评估脚本已公开于https://github.com/yuhangsu82/SketchJudge。
English
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark's effectiveness in exposing the fragility of current vision-language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at https://github.com/yuhangsu82/SketchJudge.