大型语言模型能否识别科学研究中的关键局限？一项针对AI研究论文的系统性评估

摘要

同行评审是科学研究的基石，但随着出版物数量的激增，这一高度依赖专业知识的流程面临日益严峻的挑战。尽管大语言模型（LLMs）在多项科研任务中展现出潜力，其在协助同行评审，特别是在识别论文局限性方面的应用仍待深入探索。我们首先提出了一套针对科学研究，尤其是人工智能领域局限性的全面分类体系。基于这一分类，我们推出了LimitGen，这是首个旨在评估LLMs在提供早期反馈及补充人类同行评审能力方面的综合基准。该基准包含两个子集：LimitGen-Syn，一个通过高质量论文受控扰动精心构建的合成数据集；以及LimitGen-Human，一个真实由人类撰写的局限性案例集合。为了提升LLM系统识别局限性的能力，我们为其引入了文献检索功能，这对于将局限性识别建立在先前科学发现的基础上至关重要。我们的方法增强了LLM系统在科研论文中生成局限性的能力，使其能够提供更为具体和建设性的反馈。

English

Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.

大型语言模型能否识别科学研究中的关键局限？一项针对AI研究论文的系统性评估

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

摘要

Support