教室期末試験：指導者が検証した推論ベンチマーク

要旨

本論文では、大規模言語モデルの理数系分野における推論能力を評価するためのマルチモーダルベンチマーク「Classroom Final Exam（CFE）」を提案する。CFEは、大学の授業で繰り返し出題された実際の宿題及び試験問題から構成され、担当教員による模範解答を付属している。20以上のSTEM分野を網羅する本ベンチマークは、最先端モデルにとっても重大な課題となる。最新のGemini-3.1-pro-previewの総合正答率は59.69%、第二位のGemini-3-flash-previewは55.46%であり、改善の余地が大きく残されている。リーダーボード結果に加え、模範解答を推論フローに分解する診断分析を実施した。その結果、最先端モデルは中間的な小問に正答できる場合が多いものの、多段階の解答過程を通じて正確な中間状態を確実に導出・維持することに課題があることが判明した。さらに、モデルが生成する解答は教員の模範解答に比べて推論ステップ数が多く、ステップ効率の最適化が不十分で誤差蓄積のリスクが高いことが観測された。データ及びコードはhttps://github.com/Analogy-AI/CFE_Bench で公開している。

English

We introduce (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.

教室期末試験：指導者が検証した推論ベンチマーク

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

要旨

Support