SciBench: 大規模言語モデルの大学レベルの科学的問題解決能力の評価

要旨

大規模言語モデル（LLMs）の最近の進展は、多くの数学的ベンチマークにおいて顕著な進歩を示しています。しかし、これらのベンチマークのほとんどは、中学校や高校の科目に基づいた問題しか含まず、選択式の問題に限定され、基本的な算術演算の範囲に留まっています。これらの問題に対処するため、本論文では、複雑な科学的問題解決に必要な推論能力を体系的に検証することを目的とした広範なベンチマークスイート「SciBench」を紹介します。SciBenchは、数学、化学、物理学の教科書から抽出された大学レベルの科学的問題を特徴とするオープンセットと、コンピュータサイエンスおよび数学の学部レベルの試験問題から構成されるクローズドセットの2つの慎重に選ばれたデータセットを含んでいます。これら2つのデータセットに基づいて、代表的なLLMsを様々なプロンプト戦略を用いて詳細にベンチマーク調査を行いました。その結果、現在のLLMsは満足のいく性能を発揮しておらず、全体のスコアはわずか35.80%でした。さらに、詳細なユーザースタディを通じて、LLMsが犯したエラーを10の問題解決能力に分類しました。分析によると、単一のプロンプト戦略が他を大きく上回ることはなく、特定の問題解決スキルで改善を示す戦略が他のスキルで低下を招くことが明らかになりました。SciBenchがLLMsの推論能力のさらなる発展を促進し、最終的に科学的研究と発見に貢献することを期待しています。

English

Recent advances in large language models (LLMs) have demonstrated notable progress on many mathematical benchmarks. However, most of these benchmarks only feature problems grounded in junior and senior high school subjects, contain only multiple-choice questions, and are confined to a limited scope of elementary arithmetic operations. To address these issues, this paper introduces an expansive benchmark suite SciBench that aims to systematically examine the reasoning capabilities required for complex scientific problem solving. SciBench contains two carefully curated datasets: an open set featuring a range of collegiate-level scientific problems drawn from mathematics, chemistry, and physics textbooks, and a closed set comprising problems from undergraduate-level exams in computer science and mathematics. Based on the two datasets, we conduct an in-depth benchmark study of two representative LLMs with various prompting strategies. The results reveal that current LLMs fall short of delivering satisfactory performance, with an overall score of merely 35.80%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms others and some strategies that demonstrate improvements in certain problem-solving skills result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

SciBench: 大規模言語モデルの大学レベルの科学的問題解決能力の評価

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

要旨

Support