PRISM: LLM 피어 리뷰어 평가를 위한 다차원 벤치마크

초록

머신러닝 학술지/학회에 제출되는 논문이 급속히 증가하면서 과학적 피어 리뷰 시스템에 부담이 가중되었고, 이에 따라 LLM 기반 자동 피어 리뷰어에 대한 관심도 높아졌다. 그러나 이러한 시스템이 실제로 얼마나 우수한지, 특히 과학적 공백을 포착하는 데 있어 인간 리뷰어와 비교했을 때 어떠한지는 여전히 잘 이해되지 않고 있다. 본 연구에서는 리뷰 품질을 네 가지 차원(분석 깊이, 참신성 평가, 결함 식별 및 주요 문제 우선순위, 다차원적 건설성)에 걸쳐 평가하는 벤치마킹 프레임워크인 PRISM(Peer Review Intelligence via Structured Multi-dimensional assessment)을 소개한다. ROUGE, BLEU와 같은 표면 수준의 지표나 유창성과 엄격성을 혼동하는 비제약적 LLM 평가자 프롬프팅에 기반한 대부분의 기존 평가와 달리, PRISM은 각 차원을 논증 마이닝, 검색 증강 검증, 합의 기반 점수 산정에 근거한다. 우리는 PRISM을 적용하여 ICLR, ICML, NeurIPS에서 수집한 계층화된 리뷰 코퍼스를 대상으로 5개의 주요 자동 리뷰어 시스템과 인간 리뷰어를 벤치마킹했다. 결과에 따르면, LLM은 개별 차원에서 인간 리뷰어와 동등하거나 더 나은 성과를 낼 수 있다. 즉, 비슷한 수준의 분석 깊이, 더 강력한 참신성 검증, 그리고 매우 정확한 비판 우선순위 지정이 가능하다. 그러나 단일 시스템이 모든 차원에서 인간 기준선의 균형 잡힌 성과를 일관되게 따라잡는 경우는 없었다. 각 시스템은 고유한 특화 프로필과 특징적인 사각지대, 즉 집계 지표가 완전히 놓치는 실패 모드를 보였다. 이는 LLM 리뷰어가 인간 리뷰에 대한 표적 보완재로서 가장 잘 이해되며, 특정 차원 내에서는 효과적이지만 독립적 대체재로서는 신뢰할 수 없음을 시사한다. 데모 및 주요 결과는 https://khanhthanhdev.github.io/prism-page/에서 확인할 수 있다.

English

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism-page/.