ReviewScore：基于大语言模型的误导性同行评审检测

摘要

同行评审是学术研究的基石，然而在大多数人工智能会议中，随着投稿数量的激增，评审质量正逐渐下降。为了可靠地识别低质量评审，我们将误导性评审点定义为评审中基于错误前提的“弱点”，或论文中已明确解答的“问题”。经核实，15.2%的弱点和26.4%的问题存在误导性，并引入ReviewScore指标来标识评审点是否误导。为评估每个弱点前提的真实性，我们提出了一种自动化引擎，用于重构弱点中的每一个显性和隐性前提。我们构建了一个由人类专家标注的ReviewScore数据集，以检验大语言模型（LLMs）在自动化ReviewScore评估中的能力。随后，我们利用八种当前最先进的LLMs测量了人类与模型在ReviewScore上的一致性，验证了中等程度的一致性。我们还证明，评估前提层面的真实性相比评估弱点层面的真实性，显示出显著更高的一致性。深入的差异分析进一步支持了实现完全自动化ReviewScore评估的潜力。

English

Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.

ReviewScore：基于大语言模型的误导性同行评审检测

ReviewScore: Misinformed Peer Review Detection with Large Language Models

摘要

Support