ChatPaper.aiChatPaper

审稿人二:AI是否应加入程序委员会?——同行评审未来展望

ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review

October 9, 2025
作者: Gaurav Sahu, Hugo Larochelle, Laurent Charlin, Christopher Pal
cs.AI

摘要

同行评审是科学出版的基石,然而它却面临着不一致性、评审者主观性以及可扩展性等挑战。我们推出了ReviewerToo,一个模块化框架,旨在研究和部署AI辅助的同行评审,以系统化且一致的评估来补充人类判断。ReviewerToo支持通过专门的评审角色和结构化评估标准进行系统性实验,并可部分或完全整合到实际会议流程中。我们在精心挑选的ICLR 2025年1963篇论文提交数据集上验证了ReviewerToo,其中使用gpt-oss-120b模型的实验在论文接受/拒绝分类任务上达到了81.8%的准确率,而人类评审者的平均准确率为83.9%。此外,由ReviewerToo生成的评审被LLM评判为质量高于人类平均水平,尽管仍落后于最优秀的专家贡献。我们的分析揭示了AI评审员表现出色的领域(如事实核查、文献覆盖)和其面临的挑战(如评估方法新颖性和理论贡献),强调了持续需要人类专业知识的重要性。基于这些发现,我们提出了将AI整合到同行评审流程中的指导原则,展示了AI如何提升一致性、覆盖面和公平性,同时将复杂的评估判断留给领域专家。我们的工作为系统化、混合型的同行评审系统奠定了基础,这些系统能够随着科学出版的增长而扩展。
English
Peer review is the cornerstone of scientific publishing, yet it suffers from inconsistencies, reviewer subjectivity, and scalability challenges. We introduce ReviewerToo, a modular framework for studying and deploying AI-assisted peer review to complement human judgment with systematic and consistent assessments. ReviewerToo supports systematic experiments with specialized reviewer personas and structured evaluation criteria, and can be partially or fully integrated into real conference workflows. We validate ReviewerToo on a carefully curated dataset of 1,963 paper submissions from ICLR 2025, where our experiments with the gpt-oss-120b model achieves 81.8% accuracy for the task of categorizing a paper as accept/reject compared to 83.9% for the average human reviewer. Additionally, ReviewerToo-generated reviews are rated as higher quality than the human average by an LLM judge, though still trailing the strongest expert contributions. Our analysis highlights domains where AI reviewers excel (e.g., fact-checking, literature coverage) and where they struggle (e.g., assessing methodological novelty and theoretical contributions), underscoring the continued need for human expertise. Based on these findings, we propose guidelines for integrating AI into peer-review pipelines, showing how AI can enhance consistency, coverage, and fairness while leaving complex evaluative judgments to domain experts. Our work provides a foundation for systematic, hybrid peer-review systems that scale with the growth of scientific publishing.
PDF42October 13, 2025