審稿人Too:人工智慧應該加入程序委員會嗎?探討同行評審的未來
ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review
October 9, 2025
作者: Gaurav Sahu, Hugo Larochelle, Laurent Charlin, Christopher Pal
cs.AI
摘要
同行評審是科學出版的基石,然而它存在著不一致性、評審者主觀性以及可擴展性等挑戰。我們推出了ReviewerToo,這是一個模組化框架,旨在研究和部署AI輔助的同行評審,以系統且一致的評估來補充人類判斷。ReviewerToo支持使用專門的評審角色和結構化評估標準進行系統實驗,並可部分或完全整合到實際的會議工作流程中。我們在精心策劃的ICLR 2025年1,963篇論文提交數據集上驗證了ReviewerToo,其中使用gpt-oss-120b模型的實驗在將論文分類為接受/拒絕的任務上達到了81.8%的準確率,而人類評審者的平均準確率為83.9%。此外,由ReviewerToo生成的評審被LLM評判為質量高於人類平均水平,儘管仍落後於最強的專家貢獻。我們的分析突出了AI評審表現出色的領域(例如,事實核查、文獻覆蓋)以及其面臨挑戰的領域(例如,評估方法新穎性和理論貢獻),強調了持續需要人類專業知識的重要性。基於這些發現,我們提出了將AI整合到同行評審流程中的指導方針,展示了AI如何能夠增強一致性、覆蓋面和公平性,同時將複雜的評估判斷留給領域專家。我們的工作為系統化、混合型的同行評審系統奠定了基礎,這些系統能夠隨著科學出版的增長而擴展。
English
Peer review is the cornerstone of scientific publishing, yet it suffers from
inconsistencies, reviewer subjectivity, and scalability challenges. We
introduce ReviewerToo, a modular framework for studying and deploying
AI-assisted peer review to complement human judgment with systematic and
consistent assessments. ReviewerToo supports systematic experiments with
specialized reviewer personas and structured evaluation criteria, and can be
partially or fully integrated into real conference workflows. We validate
ReviewerToo on a carefully curated dataset of 1,963 paper submissions from ICLR
2025, where our experiments with the gpt-oss-120b model achieves 81.8% accuracy
for the task of categorizing a paper as accept/reject compared to 83.9% for the
average human reviewer. Additionally, ReviewerToo-generated reviews are rated
as higher quality than the human average by an LLM judge, though still trailing
the strongest expert contributions. Our analysis highlights domains where AI
reviewers excel (e.g., fact-checking, literature coverage) and where they
struggle (e.g., assessing methodological novelty and theoretical
contributions), underscoring the continued need for human expertise. Based on
these findings, we propose guidelines for integrating AI into peer-review
pipelines, showing how AI can enhance consistency, coverage, and fairness while
leaving complex evaluative judgments to domain experts. Our work provides a
foundation for systematic, hybrid peer-review systems that scale with the
growth of scientific publishing.