ChatPaper.aiChatPaper

ReviewScore:基于大语言模型的误导性同行评审检测

ReviewScore: Misinformed Peer Review Detection with Large Language Models

September 25, 2025
作者: Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang
cs.AI

摘要

同行评审是学术研究的基石,然而在大多数人工智能会议中,随着投稿数量的激增,评审质量正逐渐下降。为了可靠地识别低质量评审,我们将误导性评审点定义为评审中基于错误前提的“弱点”,或论文中已明确解答的“问题”。经核实,15.2%的弱点和26.4%的问题存在误导性,并引入ReviewScore指标来标识评审点是否误导。为评估每个弱点前提的真实性,我们提出了一种自动化引擎,用于重构弱点中的每一个显性和隐性前提。我们构建了一个由人类专家标注的ReviewScore数据集,以检验大语言模型(LLMs)在自动化ReviewScore评估中的能力。随后,我们利用八种当前最先进的LLMs测量了人类与模型在ReviewScore上的一致性,验证了中等程度的一致性。我们还证明,评估前提层面的真实性相比评估弱点层面的真实性,显示出显著更高的一致性。深入的差异分析进一步支持了实现完全自动化ReviewScore评估的潜力。
English
Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.
PDF582September 29, 2025