LAB-Bench: 생물학 연구를 위한 언어 모델의 능력 측정

초록

최첨단 대형 언어 모델(LLMs)과 LLM 기반 시스템이 다양한 학문 분야에서 과학적 발견을 빠르게 가속화할 잠재력을 가지고 있다는 점에 대해 광범위한 낙관론이 존재합니다. 현재, 교과서 스타일의 과학 문제에 대한 LLM의 지식과 추론 능력을 측정하기 위한 많은 벤치마크가 존재하지만, 문헌 검색, 프로토콜 계획, 데이터 분석 등 과학 연구에 필요한 실질적인 작업에서 언어 모델의 성능을 평가하기 위해 설계된 벤치마크는 거의 없습니다. 이러한 벤치마크를 구축하기 위한 한 걸음으로, 우리는 Language Agent Biology Benchmark(LAB-Bench)를 소개합니다. 이는 2,400개 이상의 객관식 문제로 구성된 광범위한 데이터셋으로, 문헌에 대한 기억과 추론, 그림 해석, 데이터베이스 접근 및 탐색, DNA 및 단백질 서열의 이해와 조작 등 실질적인 생물학 연구 능력을 평가하기 위해 설계되었습니다. 중요한 점은, 이전의 과학 벤치마크와 달리, 더 어려운 LAB-Bench 작업에서 일관되게 높은 점수를 달성할 수 있는 AI 시스템이 문헌 검색 및 분자 클로닝과 같은 분야에서 연구자들에게 유용한 보조 도구로 활용될 것으로 기대된다는 것입니다. 최첨단 언어 모델의 과학적 작업 능력을 초기 평가하기 위해, 우리는 여러 모델의 성능을 측정하고 인간 전문 생물학 연구자들과의 결과를 비교하여 보고합니다. 우리는 LAB-Bench를 지속적으로 업데이트하고 확장할 예정이며, 이는 자동화된 연구 시스템 개발에 유용한 도구로 활용될 것으로 기대합니다. LAB-Bench의 공개 서브셋은 다음 URL에서 사용할 수 있습니다: https://huggingface.co/datasets/futurehouse/lab-bench

English

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench

LAB-Bench: 생물학 연구를 위한 언어 모델의 능력 측정

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

초록

Support