AbGen: 科学研究におけるアブレーション研究の設計と評価における大規模言語モデルの評価

要旨

私たちは、科学的研究におけるアブレーション研究の設計能力を評価するために設計された最初のベンチマークであるAbGenを紹介します。AbGenは、807本のNLP論文から抽出された1,500の専門家による注釈付き例で構成されています。このベンチマークでは、LLM（大規模言語モデル）に、与えられた研究コンテキストに基づいて特定のモジュールやプロセスの詳細なアブレーション研究設計を生成するタスクが課せられます。DeepSeek-R1-0528やo4-miniなどの主要なLLMを評価した結果、これらのモデルと人間の専門家との間には、アブレーション研究設計の重要性、忠実性、健全性において大きな性能差があることが明らかになりました。さらに、現在の自動評価手法は私たちのタスクに対して信頼性が低く、人間の評価と比較して大きな乖離を示すことを実証しました。これをより深く調査するために、私たちはAbGen-Evalを開発しました。これは、私たちのタスクにおけるLLMの性能を測定するために一般的に使用される自動評価システムの信頼性を評価するためのメタ評価ベンチマークです。私たちはAbGen-Eval上でさまざまなLLM-as-Judgeシステムを調査し、複雑な科学的タスクのためのより効果的で信頼性の高いLLMベースの評価システムを開発するための将来の研究に洞察を提供します。

English

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

AbGen: 科学研究におけるアブレーション研究の設計と評価における大規模言語モデルの評価

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

要旨

Support