LAB-Bench：生物学研究における言語モデルの能力測定

要旨

最先端の大規模言語モデル（LLMs）およびLLMを活用したシステムが、さまざまな分野における科学的発見を急速に加速する可能性があるという広範な楽観論が存在します。現在、教科書的な科学問題に対するLLMの知識と推論能力を測定するための多くのベンチマークが存在しますが、科学研究に必要な実践的なタスク（文献検索、プロトコル計画、データ分析など）における言語モデルの性能を評価するためのベンチマークはほとんど存在しません。そのようなベンチマークを構築するための一歩として、私たちは「Language Agent Biology Benchmark（LAB-Bench）」を紹介します。これは、文献の想起と推論、図表の解釈、データベースへのアクセスとナビゲーション、DNAおよびタンパク質配列の理解と操作など、実践的な生物学研究能力を評価するための2,400以上の多肢選択問題からなる広範なデータセットです。重要な点として、従来の科学的ベンチマークとは異なり、より難しいLAB-Benchタスクで一貫して高いスコアを達成できるAIシステムは、文献検索や分子クローニングなどの分野で研究者の有用なアシスタントとして機能すると期待されます。最先端言語モデルの新たな科学的タスク能力を初期評価するため、いくつかのモデルの性能を測定し、人間の生物学研究者の専門家と比較した結果を報告します。私たちは、LAB-Benchを今後も更新・拡張し続け、自動化された研究システムの開発において有用なツールとして活用されることを期待しています。LAB-Benchの公開サブセットは以下のURLで利用可能です： https://huggingface.co/datasets/futurehouse/lab-bench

English

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench

LAB-Bench：生物学研究における言語モデルの能力測定

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

要旨

Support