ScienceAgentBench：データ駆動型科学的発見のための言語エージェントの厳密な評価に向けて

要旨

言語モデル（LLM）の進歩により、科学的発見を自動化するLLMベースの言語エージェントの開発に対する関心が高まっており、その真の能力について興奮と懐疑が引き起こされています。本研究では、科学的発見を完全に自動化するためには、エージェントがワークフロー内のすべての重要なタスクを完了できる必要があると主張します。したがって、エージェントを完全な自動化と謳う前に、科学的ワークフロー内の個々のタスクでエージェントを厳密に評価することを求めます。このために、データ駆動型科学的発見の言語エージェントを評価するための新しいベンチマークであるScienceAgentBenchを提案します。ベンチマークの科学的信頼性と現実世界での関連性を確保するために、4つの分野から44の査読付き論文から102のタスクを抽出し、9人の専門家による検証を行います。各タスクの目標出力を、自己完結型のPythonプログラムファイルに統一し、生成されたプログラム、実行結果、およびコストを調査するためにさまざまな評価尺度を使用します。各タスクは、注釈付け者と専門家による複数ラウンドの手作業検証を経て、その注釈の品質と科学的妥当性が確保されます。また、データ汚染の懸念を緩和するための2つの効果的な戦略を提案します。提案されたベンチマークを使用して、3つのフレームワーク（直接プロンプト、OpenHands、および自己デバッグ）を持つ5つのオープンウェイトおよびプロプライエタリなLLMを評価します。各タスクについて3回の試行が与えられた場合、最も性能の良いエージェントは、専門家からの知識を使用せずに32.4％のタスクを独立して解決し、34.3％のタスクを専門家からの知識を使用して解決できます。これらの結果は、現在の言語エージェントのコード生成能力の限界を強調し、科学的研究のための完全な自動化はおろか、データ駆動型発見のためのコード生成における現在の言語エージェントの限られた能力を示しています。

English

The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about the true capabilities of such agents. In this work, we argue that for an agent to fully automate scientific discovery, it must be able to complete all essential tasks in the workflow. Thus, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. These results underscore the limited capacities of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.

ScienceAgentBench：データ駆動型科学的発見のための言語エージェントの厳密な評価に向けて

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

要旨

Support