ReviewScore: 대규모 언어 모델을 활용한 잘못된 정보가 포함된 동료 평가 탐지

초록

동료 평가는 학술 연구의 중추적 역할을 하지만, 대부분의 AI 학회에서는 제출 논문 수가 폭증하면서 리뷰 품질이 저하되고 있습니다. 저품질 리뷰를 신뢰성 있게 탐지하기 위해, 우리는 리뷰 내 "약점" 중 잘못된 전제를 포함하거나 논문에서 이미 답변할 수 있는 "질문"을 오정보 리뷰 포인트로 정의합니다. 약점의 15.2%와 질문의 26.4%가 오정보임을 검증하고, 리뷰 포인트가 오정보인지 여부를 나타내는 ReviewScore를 소개합니다. 각 약점의 전제에 대한 사실성을 평가하기 위해, 우리는 약점에서 모든 명시적 및 암묵적 전제를 재구성하는 자동화 엔진을 제안합니다. ReviewScore 평가의 자동화 가능성을 확인하기 위해 인간 전문가가 주석을 단 ReviewScore 데이터셋을 구축합니다. 그런 다음, 최신 8개의 대규모 언어 모델(LLM)을 사용하여 ReviewScore에 대한 인간-모델 간 일치도를 측정하고, 중간 수준의 일치를 검증합니다. 또한 전제 수준의 사실성 평가가 약점 수준의 사실성 평가보다 훨씬 높은 일치도를 보인다는 것을 입증합니다. 철저한 불일치 분석은 완전 자동화된 ReviewScore 평가의 잠재력을 추가로 뒷받침합니다.

English

Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.