课堂期末考试:一项经教师验证的推理基准测试
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
February 23, 2026
作者: Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen
cs.AI
摘要
我们推出 Classroom Final Exam(课堂期末考试)——一个用于评估大型语言模型在20余个STEM领域推理能力的多模态基准。该基准精选自高校长期使用的真实作业与试题,并附有授课教师提供的标准解答。即便对前沿模型而言,CFE也构成显著挑战:最新发布的Gemini-3.1-pro-preview总体准确率为59.69%,而排名第二的Gemini-3-flash-preview仅达55.46%,显示模型仍有巨大提升空间。除排行榜数据外,我们通过解构标准解答的推理流程进行诊断分析,发现前沿模型虽能正确回答中间子问题,却难以在多步求解过程中可靠推导并维持正确的中间状态。进一步观察表明,模型生成的解答通常比教师提供的方案包含更多推理步骤,反映出步骤效率欠佳及错误累积风险较高的问题。数据集与代码已开源:https://github.com/Analogy-AI/CFE_Bench。
English
We introduce (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.