PRISM：一個多維度評估LLM同行評審者的基準

摘要

機器學習領域投稿數量快速成長，已對科學同儕審查系統造成壓力，並使基於大型語言模型（LLM）的自動同行評審系統受到更多關注。然而，這些系統實際上表現如何——特別是在捕捉科學漏洞方面與人類審查者的比較——仍缺乏深入理解。在本研究中，我們提出PRISM（透過結構化多維度評估的同行評審智慧）基準框架，從四個維度評估審查品質：分析深度、新穎性評估、缺陷識別與重大問題優先排序，以及多維度建設性。不同於多數現有評估僅依賴ROUGE和BLEU等表面指標，或使用未受約束的LLM-as-a-judge提示（此類提示常混淆流暢性與嚴謹性），PRISM將每個維度奠基於論證探勘、檢索增強驗證與共識導向評分。我們應用PRISM對五個領先的自動審查系統及人類審查者進行基準測試，測試語料庫涵蓋ICLR、ICML和NeurIPS的審查意見分層樣本。結果顯示，LLM在各別維度上能匹配或超越人類審查者：分析深度相當、新穎性驗證更強、批評優先排序高度準確。然而，沒有任何單一系統能在所有維度上一致達到人類基線的均衡表現。每個系統都展現出獨特的專業化特徵，並帶有特徵性盲點——這是總體指標無法捕捉的失敗模式。這意味著LLM審查者最適合被視為人類審查的針對性輔助工具，在特定維度上有效，但無法可靠地獨立取代人類審查。我們的展示頁面與關鍵結果可參見https://khanhthanhdev.github.io/prism-page/。

English

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism-page/.