PRISM：评估LLM同行评审员的多维基准

摘要

机器学习会议投稿数量的激增给科学同行评审系统带来了压力，同时也激发了对基于大语言模型（LLM）的自动审稿系统的兴趣。然而，这些系统实际表现如何，尤其是与人类审稿人相比，在发现科学漏洞方面能力如何，目前仍鲜为人知。为此，我们提出了PRISM（通过结构化多维评估的同行评审智能）基准框架，该框架从四个维度评估审稿质量：分析深度、新颖性评估、缺陷识别与主要问题优先级排序，以及多维建设性。与大多数现有基于ROUGE、BLEU等表层指标，或依赖未受约束的LLM作为评判（易将流畅性与严谨性混为一谈）的评估方法不同，PRISM将每个维度建立在论证挖掘、检索增强验证和基于共识的评分之上。我们运用PRISM对五个领先的自动审稿系统及人类审稿人进行基准测试，测试语料来自ICLR、ICML和NeurIPS的分层审稿语料库。结果显示，LLM在单个维度上能够达到甚至超越人类审稿人：分析深度相当，新颖性验证更强，批评优先级排序高度准确。然而，没有任何一个系统能在所有维度上持续匹配人类基准的均衡表现。每个系统都展现出独特的专长轮廓和特征性盲区——这些失败模式是聚合指标完全无法捕捉的。这意味着，LLM审稿人应被理解为人类审稿的针对性补充工具，在特定维度上有效，但作为独立替代方案并不可靠。我们的演示及关键结果可访问 https://khanhthanhdev.github.io/prism-page/。

English

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism-page/.