SciCode:科学家精心策划的研究编码基准
SciCode: A Research Coding Benchmark Curated by Scientists
July 18, 2024
作者: Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huerta, Hao Peng
cs.AI
摘要
由于语言模型(LMs)现在在许多具有挑战性的任务上表现优于普通人类,因此开发具有挑战性、高质量和真实性的评估变得越来越困难。我们通过研究LMs生成解决真实科学研究问题的代码的能力来解决这一问题。结合来自16个不同自然科学子领域(包括数学、物理、化学、生物和材料科学)的科学家和人工智能研究人员的意见,我们创建了一个科学家策划的编码基准SciCode。SciCode中的问题自然地分解为多个子问题,每个子问题涉及知识回忆、推理和代码合成。总共,SciCode包含了从80个具有挑战性的主要问题分解出的338个子问题。它提供了可选描述,指定有用的科学背景信息,并为评估提供了科学家注释的黄金标准解决方案和测试用例。在经过测试的模型中,表现最佳的Claude3.5-Sonnet在最真实的环境中只能解决4.6%的问题。我们相信SciCode展示了当代LMs朝着成为有用的科学助手取得的进展,并为未来科学人工智能的发展和评估提供了启示。
English
Since language models (LMs) now outperform average humans on many challenging
tasks, it has become increasingly difficult to develop challenging,
high-quality, and realistic evaluations. We address this issue by examining
LMs' capabilities to generate code for solving real scientific research
problems. Incorporating input from scientists and AI researchers in 16 diverse
natural science sub-fields, including mathematics, physics, chemistry, biology,
and materials science, we created a scientist-curated coding benchmark,
SciCode. The problems in SciCode naturally factorize into multiple subproblems,
each involving knowledge recall, reasoning, and code synthesis. In total,
SciCode contains 338 subproblems decomposed from 80 challenging main problems.
It offers optional descriptions specifying useful scientific background
information and scientist-annotated gold-standard solutions and test cases for
evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can
solve only 4.6% of the problems in the most realistic setting. We believe that
SciCode demonstrates both contemporary LMs' progress towards becoming helpful
scientific assistants and sheds light on the development and evaluation of
scientific AI in the future.Summary
AI-Generated Summary