AbGen：評估大語言模型在科學研究中的消融研究設計與評估中的應用

摘要

我们推出了AbGen，这是首个旨在评估大语言模型（LLMs）在科学研究中设计消融实验能力的基准。AbGen包含从807篇自然语言处理（NLP）论文中提取的1,500个专家标注示例。在此基准中，LLMs的任务是根据给定的研究背景，为指定模块或过程生成详细的消融实验设计方案。我们对领先的LLMs（如DeepSeek-R1-0528和o4-mini）的评估显示，这些模型在消融实验设计的重要性、忠实性和合理性方面与人类专家存在显著性能差距。此外，我们证明当前的自动化评估方法在我们的任务中并不可靠，因为它们与人类评估相比存在显著差异。为了更好地探究这一点，我们开发了AbGen-Eval，这是一个元评估基准，旨在评估常用自动化评估系统在测量LLMs在我们任务上表现的可靠性。我们在AbGen-Eval上研究了多种LLM-as-Judge系统，为未来开发更有效、更可靠的基于LLM的复杂科学任务评估系统提供了见解。

English

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.

AbGen：評估大語言模型在科學研究中的消融研究設計與評估中的應用

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

摘要

Support