SciCode：科学家精心策划的研究编码基准

摘要

由于语言模型（LMs）现在在许多具有挑战性的任务上表现优于普通人类，因此开发具有挑战性、高质量和真实性的评估变得越来越困难。我们通过研究LMs生成解决真实科学研究问题的代码的能力来解决这一问题。结合来自16个不同自然科学子领域（包括数学、物理、化学、生物和材料科学）的科学家和人工智能研究人员的意见，我们创建了一个科学家策划的编码基准SciCode。SciCode中的问题自然地分解为多个子问题，每个子问题涉及知识回忆、推理和代码合成。总共，SciCode包含了从80个具有挑战性的主要问题分解出的338个子问题。它提供了可选描述，指定有用的科学背景信息，并为评估提供了科学家注释的黄金标准解决方案和测试用例。在经过测试的模型中，表现最佳的Claude3.5-Sonnet在最真实的环境中只能解决4.6%的问题。我们相信SciCode展示了当代LMs朝着成为有用的科学助手取得的进展，并为未来科学人工智能的发展和评估提供了启示。

English

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

SciCode：科学家精心策划的研究编码基准

SciCode: A Research Coding Benchmark Curated by Scientists

摘要

Support