ReviewScore: 大規模言語モデルを用いた誤情報を含む査読検出

要旨

ピアレビューは学術研究の基盤として機能しているが、ほとんどのAI会議では、投稿数の爆発的増加に伴い、レビューの質が低下している。低品質なレビューを確実に検出するため、我々は「誤った前提を含むレビューの弱点」または「論文ですでに回答可能なレビューの質問」を「誤情報レビューポイント」と定義する。15.2%の弱点と26.4%の質問が誤情報であることを確認し、レビューポイントが誤情報かどうかを示すReviewScoreを導入する。弱点の各前提の事実性を評価するため、弱点から明示的および暗黙的な前提を再構築する自動エンジンを提案する。ReviewScoreの評価を自動化するためのLLMの能力を検証するため、人間の専門家による注釈付きReviewScoreデータセットを構築する。次に、8つの最新のLLMを用いてReviewScoreに関する人間とモデルの一致度を測定し、中程度の一致度を確認する。また、前提レベルの事実性評価は、弱点レベルの事実性評価よりも有意に高い一致度を示すことを証明する。徹底的な不一致分析により、完全自動化されたReviewScore評価の可能性がさらに支持される。

English

Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.