교실 기말고사: 강사가 검증한 추론 벤치마크

초록

우리는 20개 이상의 STEM 분야에 걸쳐 대규모 언어 모델의 추론 능력을 평가하기 위한 다중 모드 벤치마크인 (Classroom Final Exam)을 소개합니다. 는 대학에서 반복적으로 사용된 실제 숙제 및 시험 문제와 강의 담당 교수가 제공한 참고 해답으로 구성되어 있습니다. 는 최첨단 모델에게도 상당한 도전 과제로, 최근 공개된 Gemini-3.1-pro-preview의 전체 정확도는 59.69%에 그치는 반면 두 번째로 성능이 좋은 모델인 Gemini-3-flash-preview는 55.46%를 달성하여 개선 여지가 크게 남아 있습니다. 리더보드 결과를 넘어서, 우리는 참고 해답을 추론 흐름으로 분해하여 진단 분석을 수행합니다. 분석 결과, 최첨단 모델이 중간 하위 질문에 대한 정답을 종종 맞힐 수는 있지만, 다단계 해결 과정 전반에 걸쳐 정확한 중간 상태를 안정적으로 도출하고 유지하는 데 어려움을 겪는 것으로 나타났습니다. 또한 모델이 생성한 해답은 일반적으로 교수자가 제공한 해답보다 추론 단계가 더 많아, 단계 효율성이 낮고 오류 누적 위험이 더 높음을 관찰했습니다. 데이터와 코드는 https://github.com/Analogy-AI/CFE_Bench에서 확인할 수 있습니다.

English

We introduce (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.

교실 기말고사: 강사가 검증한 추론 벤치마크

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

초록

Support