ChatPaper.aiChatPaper

ReviewScore:基於大型語言模型的誤導性同行評審檢測

ReviewScore: Misinformed Peer Review Detection with Large Language Models

September 25, 2025
作者: Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang
cs.AI

摘要

同行評審作為學術研究的基石,然而在大多數人工智慧會議中,隨著投稿量的激增,評審質量正逐漸下降。為有效識別低質量的評審意見,我們將誤導性評審點定義為評審中基於錯誤前提的“弱點”,或評審中提出的、論文已解答的“問題”。經核實,15.2%的弱點與26.4%的問題存在誤導性,並引入ReviewScore指標來標示評審點是否誤導。為評估每個弱點前提的真實性,我們提出了一種自動化引擎,用於重構弱點中的每一個顯性與隱性前提。我們構建了一個由人類專家標註的ReviewScore數據集,以檢驗大型語言模型(LLMs)在自動化ReviewScore評估中的能力。隨後,我們利用八種當前最先進的LLMs測量了人類與模型在ReviewScore上的一致性,證實了中等程度的一致性。此外,我們證明,評估前提層面的真實性相比評估弱點層面的真實性,顯示出顯著更高的一致性。深入的差異分析進一步支持了實現全自動ReviewScore評估的潛力。
English
Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.
PDF582September 29, 2025