ChatPaper.aiChatPaper

大型語言模型能否評估學生學習困境?基於能力模擬的人機難度校準在試題難度預測中的應用

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

December 21, 2025
作者: Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou
cs.AI

摘要

精確評估試題(問題或任務)難度對教育評量至關重要,但始終面臨冷啟動問題的挑戰。儘管大型語言模型展現出超乎人類的問題解決能力,其是否能感知人類學習者的認知困境仍是未解之謎。本研究針對醫學知識與數學推理等多元領域,對超過20個模型進行大規模的「人類與AI難度對齊」實證分析。研究發現系統性的對齊失準現象:擴大模型規模並不可靠地改善對齊效果,模型反而趨向形成機器共識而非與人類認知對齊。我們觀察到,高解題性能往往阻礙準確的難度評估——即使明確要求模型模擬特定能力水平的學生,其仍難以重現學生的能力局限。更關鍵的是,模型存在自省能力缺失,無法預測自身局限。這些結果表明,通用問題解決能力並不意味著能理解人類認知困境,凸顯當前模型用於自動化難度預測的挑戰。
English
Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.
PDF212December 24, 2025