ReviewScore：基於大型語言模型的誤導性同行評審檢測

摘要

同行評審作為學術研究的基石，然而在大多數人工智慧會議中，隨著投稿量的激增，評審質量正逐漸下降。為有效識別低質量的評審意見，我們將誤導性評審點定義為評審中基於錯誤前提的“弱點”，或評審中提出的、論文已解答的“問題”。經核實，15.2%的弱點與26.4%的問題存在誤導性，並引入ReviewScore指標來標示評審點是否誤導。為評估每個弱點前提的真實性，我們提出了一種自動化引擎，用於重構弱點中的每一個顯性與隱性前提。我們構建了一個由人類專家標註的ReviewScore數據集，以檢驗大型語言模型（LLMs）在自動化ReviewScore評估中的能力。隨後，我們利用八種當前最先進的LLMs測量了人類與模型在ReviewScore上的一致性，證實了中等程度的一致性。此外，我們證明，評估前提層面的真實性相比評估弱點層面的真實性，顯示出顯著更高的一致性。深入的差異分析進一步支持了實現全自動ReviewScore評估的潛力。

English

Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.

ReviewScore：基於大型語言模型的誤導性同行評審檢測

ReviewScore: Misinformed Peer Review Detection with Large Language Models

摘要

Support