論AI審稿人的限制與機會：與45位專家科學家審視Nature系列論文的審稿意見

摘要

隨著人工智慧能力的進步，AI 審稿人開始被應用於科學同儕審查中，然而其能力與可信度仍備受質疑：許多科學家僅將其視為機率系統，缺乏評估研究的專業能力；而另一些研究人員則在缺乏具體證據的情況下，對其準備就緒程度抱持較樂觀的態度。了解 AI 審稿人的優勢、不足之處以及尚存的挑戰至關重要。然而，現有對 AI 審稿人的評估主要聚焦於其評判是否與人類評判一致（例如分數相符性、接受與否的預測），這不足以全面描述其能力與限制。本文透過大規模專家註釋研究來填補此一缺口：45 位來自物理、生物與健康科學領域的科學家，耗費 469 小時，針對 82 篇《自然》系列論文的人類撰寫與 AI 生成的審稿意見中，共 2,960 條具體批評（每條針對論文的某一特定面向），就正確性、重要性與證據充分性進行評分。在三個面向的綜合評分中，基於 GPT-5.2 的審稿代理得分高於每篇論文中評價最高的人類審稿人（60.0% 對 48.2%，p = 0.009），而所有三個 AI 審稿人（包括 Gemini 3.0 Pro 與 Claude Opus 4.5）在每個面向上的表現均超過評價最低的人類審稿人。AI 審稿人提出的準確批評，也更常被評為重要且證據充分，並能揭露 26% 人類未曾提出的獨特問題。然而，AI 審稿人之間的重疊程度遠高於人類審稿人之間（跨審稿人配對的重疊率為 21% 對 3%），並展現出人類審稿人沒有的 16 項重複性弱點，例如對子領域知識有限、缺乏跨多個檔案的長上下文管理能力，以及對小問題過於嚴苛的評論立場。整體而言，我們的研究結果顯示，當前的 AI 審稿人定位為人類審稿人的補充，而非替代品。

English

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.