PRISM: LLM査読者評価のための多次元ベンチマーク

要旨

機械学習分野の会議への投稿数の急増は、科学ピアレビューシステムに負担をかけ、大規模言語モデル（LLM）ベースの自動ピアレビュアーへの関心を高めている。しかし、こうしたシステムの実際の品質、とりわけ科学的な欠陥を発見する能力に関して、人間のレビュアーと比較してどの程度優れているかは、未だ十分に理解されていない。本研究では、レビュー品質を「分析の深さ」「新規性の評価」「欠陥の特定と主要問題の優先順位付け」「多次元的な建設性」の4つの次元で評価するベンチマークフレームワークPRISM（Peer Review Intelligence via Structured Multi-dimensional assessment）を導入する。既存の評価のほとんどがROUGEやBLEUといった表面的な指標、あるいは流暢さと厳密性を混同する無制約なLLM-as-a-judgeプロンプトに依存しているのに対し、PRISMは各次元を議論マイニング、検索拡張検証、合意形成に基づくスコアリングに基づいて構築している。我々はPRISMを適用し、ICLR、ICML、NeurIPSからの層別化されたレビューコーパスを用いて、5つの主要な自動レビュアーシステムと人間のレビュアーをベンチマークする。結果は、LLMが個々の次元において人間のレビュアーに匹敵または凌駕することを示している：同等の分析の深さ、より強力な新規性検証、非常に正確な批判の優先順位付けである。しかし、すべての次元において同時に人間のベースラインのバランスの取れた性能に一貫して一致する単一のシステムは存在しない。それぞれが特徴的な専門化プロファイルを示し、固有の盲点——集約指標では完全に見逃される失敗モード——を持つ。このことは、LLMレビュアーは人間によるレビューへの補完的ツールとして最もよく理解され、特定の次元では有効であるが、単独での代替としては信頼できないことを示唆する。デモと主要結果はhttps://khanhthanhdev.github.io/prism-page/で公開している。

English

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism-page/.