AbGen:评估大语言模型在科学研究中的消融实验设计与评估中的应用
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
July 17, 2025
作者: Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan
cs.AI
摘要
我们推出了AbGen,这是首个旨在评估大语言模型(LLMs)在科学研究中设计消融实验能力的基准。AbGen包含1,500个由专家标注的示例,这些示例源自807篇自然语言处理(NLP)论文。在该基准中,LLMs的任务是根据给定的研究背景,为指定模块或流程生成详细的消融实验设计方案。我们对DeepSeek-R1-0528和o4-mini等领先LLMs的评估显示,这些模型在消融实验设计的重要性、忠实性和合理性方面与人类专家存在显著差距。此外,我们证明当前的自动化评估方法在我们的任务中并不可靠,因为它们与人类评估相比存在显著差异。为了更好地探究这一点,我们开发了AbGen-Eval,这是一个元评估基准,旨在评估常用自动化评估系统在衡量LLMs执行我们任务时的可靠性。我们在AbGen-Eval上研究了多种LLM-as-Judge系统,为未来开发更有效、更可靠的基于LLM的复杂科学任务评估系统提供了洞见。
English
We introduce AbGen, the first benchmark designed to evaluate the capabilities
of LLMs in designing ablation studies for scientific research. AbGen consists
of 1,500 expert-annotated examples derived from 807 NLP papers. In this
benchmark, LLMs are tasked with generating detailed ablation study designs for
a specified module or process based on the given research context. Our
evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a
significant performance gap between these models and human experts in terms of
the importance, faithfulness, and soundness of the ablation study designs.
Moreover, we demonstrate that current automated evaluation methods are not
reliable for our task, as they show a significant discrepancy when compared to
human assessment. To better investigate this, we develop AbGen-Eval, a
meta-evaluation benchmark designed to assess the reliability of commonly used
automated evaluation systems in measuring LLM performance on our task. We
investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for
future research on developing more effective and reliable LLM-based evaluation
systems for complex scientific tasks.