物理學:評估基礎模型在大學層級物理問題解決上的表現
PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving
March 26, 2025
作者: Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, Arman Cohan
cs.AI
摘要
我們推出PHYSICS,這是一個針對大學層級物理問題解決的全面性基準測試。它包含1297道專家註解的問題,涵蓋六個核心領域:經典力學、量子力學、熱力學與統計力學、電磁學、原子物理學以及光學。每道問題都需要高階的物理知識與數學推理能力。我們開發了一套穩健的自動化評估系統,以實現精確且可靠的驗證。我們對領先的基礎模型進行評估,發現了顯著的局限性。即便是最先進的o3-mini模型,其準確率也僅達到59.9%,這凸顯了在解決高階科學問題上的重大挑戰。透過全面的錯誤分析、多樣化提示策略的探索,以及基於檢索增強生成(RAG)的知識擴充,我們識別出關鍵的改進領域,為未來的進步奠定了基礎。
English
We introduce PHYSICS, a comprehensive benchmark for university-level physics
problem solving. It contains 1297 expert-annotated problems covering six core
areas: classical mechanics, quantum mechanics, thermodynamics and statistical
mechanics, electromagnetism, atomic physics, and optics. Each problem requires
advanced physics knowledge and mathematical reasoning. We develop a robust
automated evaluation system for precise and reliable validation. Our evaluation
of leading foundation models reveals substantial limitations. Even the most
advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant
challenges in solving high-level scientific problems. Through comprehensive
error analysis, exploration of diverse prompting strategies, and
Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify
key areas for improvement, laying the foundation for future advancements.Summary
AI-Generated Summary