ChatPaper.aiChatPaper

大型語言模型能否識別科學研究中的關鍵侷限?針對AI研究論文的系統性評估

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

July 3, 2025
作者: Zhijian Xu, Yilun Zhao, Manasi Patwardhan, Lovekesh Vig, Arman Cohan
cs.AI

摘要

同行評審是科學研究的基石,然而日益增長的出版物數量加劇了這一專業密集型過程的挑戰。儘管大型語言模型(LLMs)在各種科學任務中展現出潛力,但其在協助同行評審,尤其是在識別論文侷限性方面的潛力仍未被充分研究。我們首先提出了一個針對科學研究,特別是人工智慧領域的侷限性類型綜合分類法。基於此分類法,我們推出了LimitGen,這是首個全面評估LLMs在支持早期反饋和補充人類同行評審能力的基準。我們的基準由兩個子集組成:LimitGen-Syn,一個通過對高質量論文進行受控擾動精心創建的合成數據集;以及LimitGen-Human,一個收集了真實人類撰寫的侷限性描述的數據集。為了提升LLM系統識別侷限性的能力,我們為其增加了文獻檢索功能,這對於將識別侷限性建立在先前科學發現的基礎上至關重要。我們的方法增強了LLM系統在研究論文中生成侷限性的能力,使其能夠提供更具體和建設性的反饋。
English
Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.
PDF81July 4, 2025