SciBench：评估大型语言模型在大学水平科学问题解决能力方面的表现

摘要

最近大型语言模型（LLMs）的进展在许多数学基准测试中展示了显著的进步。然而，大多数这些基准测试只涉及初高中学科中的问题，仅包含多项选择题，并且局限于有限范围的基础算术运算。为了解决这些问题，本文引入了一个广泛的基准套件 SciBench，旨在系统地检验复杂科学问题解决所需的推理能力。SciBench 包含两个精心策划的数据集：一个开放集，包括从数学、化学和物理教科书中提取的一系列大学水平科学问题，以及一个封闭集，包括来自计算机科学和数学本科考试的问题。基于这两个数据集，我们对两个代表性的LLMs进行了深入的基准测试研究，采用了各种提示策略。结果显示，当前的LLMs在提供令人满意的性能方面表现不佳，整体得分仅为35.80%。此外，通过一项详细的用户研究，我们将LLMs的错误归类为十种解决问题的能力。我们的分析表明，没有单一提示策略明显优于其他策略，而一些策略在某些解决问题技能方面表现出改进，却导致其他技能下降。我们期望 SciBench 将推动LLMs推理能力的进一步发展，从而最终促进科学研究和发现。

English

Recent advances in large language models (LLMs) have demonstrated notable progress on many mathematical benchmarks. However, most of these benchmarks only feature problems grounded in junior and senior high school subjects, contain only multiple-choice questions, and are confined to a limited scope of elementary arithmetic operations. To address these issues, this paper introduces an expansive benchmark suite SciBench that aims to systematically examine the reasoning capabilities required for complex scientific problem solving. SciBench contains two carefully curated datasets: an open set featuring a range of collegiate-level scientific problems drawn from mathematics, chemistry, and physics textbooks, and a closed set comprising problems from undergraduate-level exams in computer science and mathematics. Based on the two datasets, we conduct an in-depth benchmark study of two representative LLMs with various prompting strategies. The results reveal that current LLMs fall short of delivering satisfactory performance, with an overall score of merely 35.80%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms others and some strategies that demonstrate improvements in certain problem-solving skills result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

SciBench：评估大型语言模型在大学水平科学问题解决能力方面的表现

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

摘要

Support