SciCode：科學家精心策劃的研究編碼基準

摘要

由於語言模型（LMs）現在在許多具有挑戰性的任務上表現優於普通人，因此開發具有挑戰性、高質量和逼真的評估越來越困難。我們通過檢驗LMs生成解決真實科學研究問題的代碼的能力來解決這個問題。我們結合科學家和16個不同自然科學子領域的AI研究人員的意見，包括數學、物理學、化學、生物學和材料科學，創建了一個科學家策劃的編碼基準SciCode。SciCode中的問題自然地分解為多個子問題，每個子問題都涉及知識回憶、推理和代碼合成。總共，SciCode包含了從80個具有挑戰性的主問題分解出的338個子問題。它提供了可選的描述，指定有用的科學背景信息以及科學家注釋的黃金標準解決方案和測試用例以供評估。在經過測試的模型中，表現最佳的模型Claude3.5-Sonnet只能在最逼真的環境中解決4.6％的問題。我們相信SciCode展示了當代LMs在成為有用的科學助手方面取得的進展，並為未來科學AI的發展和評估提供了啟示。

English

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

SciCode：科學家精心策劃的研究編碼基準

SciCode: A Research Coding Benchmark Curated by Scientists

摘要

Support