SciCode: 科学者による研究用コーディングベンチマーク

要旨

言語モデル（LMs）が多くの困難なタスクにおいて平均的な人間を凌駕するようになった現在、挑戦的で高品質かつ現実的な評価を開発することはますます困難になっている。この問題に対処するため、我々はLMsが実際の科学研究問題を解決するためのコードを生成する能力を検証した。数学、物理学、化学、生物学、材料科学を含む16の多様な自然科学分野の科学者およびAI研究者からの入力を取り入れ、科学者によってキュレートされたコーディングベンチマーク「SciCode」を作成した。SciCodeの問題は自然に複数のサブ問題に分解され、各サブ問題は知識の想起、推論、コード合成を含む。全体で、SciCodeは80の困難な主要問題から分解された338のサブ問題を含む。評価のための有用な科学的背景情報を指定するオプションの説明と、科学者によって注釈が付けられたゴールドスタンダードの解決策およびテストケースを提供する。テストされたモデルの中で最も性能が高かったClaude3.5-Sonnetでさえ、最も現実的な設定においても問題の4.6%しか解決できない。我々は、SciCodeが現代のLMsが有用な科学アシスタントになるための進歩を示すとともに、将来の科学的AIの開発と評価に光を当てるものであると信じている。

English

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

SciCode: 科学者による研究用コーディングベンチマーク

SciCode: A Research Coding Benchmark Curated by Scientists

要旨

Support