ChatPaper.aiChatPaper

物理学:评估基础模型在大学物理问题解决中的表现

PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

March 26, 2025
作者: Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, Arman Cohan
cs.AI

摘要

我们推出了PHYSICS,一个面向大学物理问题解决的综合基准测试。该基准包含1297道专家标注的题目,涵盖六大核心领域:经典力学、量子力学、热力学与统计力学、电磁学、原子物理学以及光学。每道题目均需运用高阶物理知识与数学推理能力。我们开发了一套稳健的自动化评估系统,以确保验证的精确性与可靠性。对领先的基础模型进行评估后,发现其存在显著局限。即便是最先进的o3-mini模型,准确率也仅为59.9%,凸显了解决高层次科学问题所面临的重大挑战。通过全面的错误分析、多样提示策略的探索,以及基于检索增强生成(RAG)的知识扩充,我们识别出关键改进领域,为未来的进步奠定了基础。
English
We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving. It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems. Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements.

Summary

AI-Generated Summary

PDF172March 31, 2025