AIレビュワーの限界と機会：45名の専門科学者によるNature系列誌のレビュー評価

要旨

AI能力の進展に伴い、科学分野のピアレビューにAIレビュアーが導入され始めているが、その能力と信頼性は依然として疑問視されている。多くの科学者はAIを研究を評価する専門性を持たない確率的システムと見なす一方、一部の研究者は確固たる証拠もなくその即戦力性に楽観的である。AIレビュアーが何を得意とし、どこに限界があり、どのような課題が残されているのかを理解することは不可欠である。しかし、既存のAIレビュアーの評価は、その判定が人間の判定と一致するかどうか（例：スコアの一致、受理予測）に焦点を当てており、能力や限界を特徴づけるには不十分である。本論文では、45名の物理学、生物学、健康科学の分野専門家が、82本のNature系論文に対する人間執筆およびAI生成のレビューに含まれる2,960件の個別批評（それぞれ論文の特定の側面を対象とする）を、正確性、重要性、エビデンスの十分性について評価するために計469時間を費やした大規模な専門家アノテーション研究により、このギャップを埋める。3つの次元すべてを複合した指標において、GPT-5.2を搭載したレビューエージェントは各論文の最高評価の人間レビュアーを上回った（60.0%対48.2%、p = 0.009）。一方、3つのAIレビュアー（Gemini 3.0 ProおよびClaude Opus 4.5を含む）は、すべての次元で最低評価の人間レビュアーを上回った。AIレビュアーの正確な批評は、有意義で十分な裏付けがあると評価される傾向が強く、人間が指摘しない26%の課題を独自に浮き彫りにする。しかし、AIレビュアー間の重複は人間間よりもはるかに大きく（レビュアーペア間で21%対3%）、限られたサブフィールド知識、複数ファイルにわたる長期コンテキスト管理の欠如、軽微な問題に対する過度に批判的な姿勢など、人間には見られない16の繰り返し発生する弱点を示す。全体として、本結果は現在のAIレビュアーを人間のレビュアーの代替ではなく補完として位置づけるものである。

English

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.