AbGen：评估大语言模型在科学研究中的消融实验设计与评估中的应用

摘要

我们推出了AbGen，这是首个旨在评估大语言模型（LLMs）在科学研究中设计消融实验能力的基准。AbGen包含1,500个由专家标注的示例，这些示例源自807篇自然语言处理（NLP）论文。在该基准中，LLMs的任务是根据给定的研究背景，为指定模块或流程生成详细的消融实验设计方案。我们对DeepSeek-R1-0528和o4-mini等领先LLMs的评估显示，这些模型在消融实验设计的重要性、忠实性和合理性方面与人类专家存在显著差距。此外，我们证明当前的自动化评估方法在我们的任务中并不可靠，因为它们与人类评估相比存在显著差异。为了更好地探究这一点，我们开发了AbGen-Eval，这是一个元评估基准，旨在评估常用自动化评估系统在衡量LLMs执行我们任务时的可靠性。我们在AbGen-Eval上研究了多种LLM-as-Judge系统，为未来开发更有效、更可靠的基于LLM的复杂科学任务评估系统提供了洞见。

English

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

AbGen：评估大语言模型在科学研究中的消融实验设计与评估中的应用

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

摘要

Support