LLMは科学研究における重要な限界を特定できるか？AI研究論文に対する体系的評価

要旨

査読は科学研究において基本的なプロセスであるが、出版物の増加に伴い、この専門知識を要するプロセスの課題が一層深刻化している。大規模言語モデル（LLM）はさまざまな科学的タスクで有望な成果を示しているものの、特に論文の限界点を特定するという査読支援におけるその潜在能力は未だ十分に研究されていない。本研究ではまず、AIを中心とした科学研究における限界点の類型化を包括的に提示する。この類型化を基盤として、限界点の研究に向けて、初期段階のフィードバックを支援し、人間の査読を補完するLLMの能力を評価するための初の包括的ベンチマークであるLimitGenを提案する。本ベンチマークは2つのサブセットで構成される：LimitGen-Synは、高品質な論文を制御された摂動によって慎重に作成した合成データセットであり、LimitGen-Humanは実際に人間が記述した限界点のコレクションである。LLMシステムが限界点を特定する能力を向上させるため、文献検索を組み込むことで、先行する科学的知見に基づいた限界点の特定を可能にする。本アプローチは、研究論文における限界点の生成能力を強化し、より具体的で建設的なフィードバックを提供することを可能にする。

English

Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs' capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.

LLMは科学研究における重要な限界を特定できるか？AI研究論文に対する体系的評価

Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

要旨

Support