论AI审稿人的局限与机遇：与45位专家科学家共同评析《自然》系列期刊的审稿意见

摘要

随着人工智能能力的提升，AI审稿人开始被部署在科研同行评审中，但其能力和可信度仍存疑：许多科学家仅将其视为缺乏评估研究专业知识的概率系统，而另一些研究人员则对其准备充分性持乐观态度却无实证支撑。理解AI审稿人擅长什么、存在哪些不足、以及需要应对哪些挑战至关重要。然而，现有对AI审稿人的评估主要集中于其判断是否与人类判断一致（如评分对齐、录用预测），这难以全面表征其能力与局限。本文通过一项大规模专家标注研究填补这一空白：来自物理、生物与健康科学领域的45位领域科学家耗时469小时，对82篇Nature系列论文的人类撰写评审与AI生成评审中的2960条独立批评（每条针对论文某一特定方面）进行了"正确性""重要性"及"证据充分性"三维度评级。在三个维度的综合得分上，基于GPT-5.2的审稿智能体超过了每篇论文中评分最高的人类审稿人（60.0% vs. 48.2%, p = 0.009），而所有三个AI审稿人（包括Gemini 3.0 Pro和Claude Opus 4.5）在每个维度上都超过了评分最低的人类审稿人。同时，AI审稿人提出的准确批评往往更常被评价为"重要"且"证据充分"，并能挖掘出人类未提出的26%的独特问题。然而，AI审稿人之间的重叠程度远高于人类（跨审稿人对重叠率：21% vs. 3%），并表现出16个人类不具备的重复性弱点，如对子领域知识有限、缺乏跨多个文件的长上下文管理能力，以及对小问题的过度批评倾向。总体而言，我们的结果表明当前AI审稿人是人类审稿人的补充，而非替代。

English

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.