LAB-Bench:衡量生物學語言模型能力的研究
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
July 14, 2024
作者: Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, Samuel G. Rodriques
cs.AI
摘要
廣泛樂觀認為,前沿大型語言模型(LLMs)和LLM增強系統具有潛力快速推動跨學科科學發現。如今,存在許多基準來衡量LLM在教科書風格科學問題上的知識和推理能力,但幾乎沒有基準旨在評估語言模型在科學研究所需的實際任務上的表現,例如文獻搜索、協議規劃和數據分析。為了建立此類基準的一個步驟,我們引入了語言代理生物學基準(LAB-Bench),這是一個包含超過2,400道多項選擇題的廣泛數據集,用於評估AI系統在一系列實際生物學研究能力上的表現,包括對文獻的回憶和推理、圖表解釋、訪問和導航數據庫,以及對DNA和蛋白質序列的理解和操作。重要的是,與以往的科學基準相比,我們預期,能夠在LAB-Bench更難的任務上持續取得高分的AI系統將成為研究人員在文獻搜索和分子克隆等領域的有用助手。作為對前沿語言模型新興科學任務能力的初步評估,我們測量了幾個模型對我們基準的表現,並報告了與人類專家生物學研究人員的比較結果。我們將繼續隨時間更新和擴展LAB-Bench,並期望它成為未來自動化研究系統開發中的一個有用工具。LAB-Bench的公共子集可在以下網址使用:https://huggingface.co/datasets/futurehouse/lab-bench
English
There is widespread optimism that frontier Large Language Models (LLMs) and
LLM-augmented systems have the potential to rapidly accelerate scientific
discovery across disciplines. Today, many benchmarks exist to measure LLM
knowledge and reasoning on textbook-style science questions, but few if any
benchmarks are designed to evaluate language model performance on practical
tasks required for scientific research, such as literature search, protocol
planning, and data analysis. As a step toward building such benchmarks, we
introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of
over 2,400 multiple choice questions for evaluating AI systems on a range of
practical biology research capabilities, including recall and reasoning over
literature, interpretation of figures, access and navigation of databases, and
comprehension and manipulation of DNA and protein sequences. Importantly, in
contrast to previous scientific benchmarks, we expect that an AI system that
can achieve consistently high scores on the more difficult LAB-Bench tasks
would serve as a useful assistant for researchers in areas such as literature
search and molecular cloning. As an initial assessment of the emergent
scientific task capabilities of frontier language models, we measure
performance of several against our benchmark and report results compared to
human expert biology researchers. We will continue to update and expand
LAB-Bench over time, and expect it to serve as a useful tool in the development
of automated research systems going forward. A public subset of LAB-Bench is
available for use at the following URL:
https://huggingface.co/datasets/futurehouse/lab-benchSummary
AI-Generated Summary