SciCode: 과학자들이 선별한 연구 코딩 벤치마크

초록

언어 모델(LMs)이 이제 많은 도전적인 과제에서 평균적인 인간을 능가함에 따라, 도전적이고 고품질이며 현실적인 평가를 개발하는 것이 점점 더 어려워지고 있습니다. 우리는 이 문제를 해결하기 위해 실제 과학 연구 문제를 해결하기 위한 코드 생성 능력을 LMs의 능력을 조사함으로써 접근했습니다. 수학, 물리학, 화학, 생물학, 재료 과학을 포함한 16개의 다양한 자연과학 하위 분야의 과학자와 AI 연구자들의 입력을 통합하여, 과학자들이 선별한 코딩 벤치마크인 SciCode를 만들었습니다. SciCode의 문제는 자연스럽게 여러 하위 문제로 분해되며, 각 하위 문제는 지식 회상, 추론 및 코드 합성을 포함합니다. 총 80개의 도전적인 주요 문제에서 분해된 338개의 하위 문제로 구성된 SciCode는 유용한 과학적 배경 정보를 지정하는 선택적 설명과 평가를 위한 과학자 주석의 골드 스탠다드 솔루션 및 테스트 케이스를 제공합니다. 테스트된 모델 중 가장 성능이 좋은 Claude3.5-Sonnet은 가장 현실적인 설정에서도 문제의 4.6%만 해결할 수 있습니다. 우리는 SciCode가 현대 LMs의 유용한 과학적 보조자로 나아가는 진전을 보여주고, 미래의 과학적 AI 개발과 평가에 대한 통찰을 제공한다고 믿습니다.

English

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

SciCode: 과학자들이 선별한 연구 코딩 벤치마크

SciCode: A Research Coding Benchmark Curated by Scientists

초록

Support