实验台:衡量生物学领域语言模型能力的研究
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
July 14, 2024
作者: Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, Samuel G. Rodriques
cs.AI
摘要
人们普遍乐观地认为,前沿的大型语言模型(LLMs)和LLM增强系统有潜力快速推动跨学科科学发现。如今,存在许多基准来衡量LLM在类似教科书的科学问题上的知识和推理能力,但几乎没有基准旨在评估语言模型在科学研究所需的实际任务上的表现,比如文献检索、方案规划和数据分析。为了构建这样的基准,我们引入了语言代理生物学基准(LAB-Bench),这是一个包含超过2,400道多项选择题的广泛数据集,用于评估人工智能系统在一系列实际生物学研究能力上的表现,包括文献回忆和推理、图表解释、数据库访问和导航,以及对DNA和蛋白质序列的理解和操作。值得注意的是,与以往的科学基准相比,我们期望一个能够在更困难的LAB-Bench任务上始终取得高分的人工智能系统将成为研究人员在文献检索和分子克隆等领域的有用助手。作为对前沿语言模型新兴科学任务能力的初步评估,我们对几个模型在我们的基准上的表现进行了评估,并将结果与人类专家生物学研究人员进行了比较。我们将继续随时间更新和扩展LAB-Bench,并期望它成为未来自动化研究系统开发中的有用工具。LAB-Bench的公共子集可在以下网址使用:https://huggingface.co/datasets/futurehouse/lab-bench
English
There is widespread optimism that frontier Large Language Models (LLMs) and
LLM-augmented systems have the potential to rapidly accelerate scientific
discovery across disciplines. Today, many benchmarks exist to measure LLM
knowledge and reasoning on textbook-style science questions, but few if any
benchmarks are designed to evaluate language model performance on practical
tasks required for scientific research, such as literature search, protocol
planning, and data analysis. As a step toward building such benchmarks, we
introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of
over 2,400 multiple choice questions for evaluating AI systems on a range of
practical biology research capabilities, including recall and reasoning over
literature, interpretation of figures, access and navigation of databases, and
comprehension and manipulation of DNA and protein sequences. Importantly, in
contrast to previous scientific benchmarks, we expect that an AI system that
can achieve consistently high scores on the more difficult LAB-Bench tasks
would serve as a useful assistant for researchers in areas such as literature
search and molecular cloning. As an initial assessment of the emergent
scientific task capabilities of frontier language models, we measure
performance of several against our benchmark and report results compared to
human expert biology researchers. We will continue to update and expand
LAB-Bench over time, and expect it to serve as a useful tool in the development
of automated research systems going forward. A public subset of LAB-Bench is
available for use at the following URL:
https://huggingface.co/datasets/futurehouse/lab-benchSummary
AI-Generated Summary