大型语言模型能否评估学生困境?基于能力模拟的人机难度对齐在题目难度预测中的应用
Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
December 21, 2025
作者: Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou
cs.AI
摘要
准确评估题目(问题或任务)难度对教育测评至关重要,但存在冷启动问题。尽管大语言模型展现出超强的问题解决能力,它们是否能感知人类学习者的认知困境仍存疑问。本研究针对医学知识和数学推理等多元领域,对超过20个模型进行了大规模的人类与AI难度对齐实证分析。研究发现存在系统性错位现象:扩大模型规模并不能可靠改善对齐效果;模型非但未能与人类认知对齐,反而趋近于形成机器共识。我们观察到,高性能往往阻碍准确的难度评估——即使明确要求模型模拟特定能力水平,它们仍难以复现学生的能力局限。此外,模型存在显著的内省缺失,无法预判自身局限。这些结果表明,通用问题解决能力并不等同于对人类认知困境的理解,凸显出现有模型在自动化难度预测应用中的挑战。
English
Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.