Review Arcade: LLMレビューにおける人間との整合性とゲーム耐性

要旨

LLMによる科学論文のレビュー生成が急速に注目を集めており、主要な学会で公式に試験的に導入され始めている。査読者だけでなく、著者も投稿前に論文を修正するためにLLMを利用していると想定しなければならない。本研究では、2025年ACL Rolling Review (ARR) に投稿された論文を対象に実証実験を行い、著者と査読者の両方の視点からLLMレビューを評価する。第一に、LLMレビューと人間によるレビューの間には限定的な一致性しか見られないことを明らかにする。最良のシナリオでは、その一致性は妥当な水準にある。しかし、LLMと人間の一致性はプロンプトやモデルによって大きく異なることも判明した。最後に、著者がLLMレビューに従って投稿原稿を改善するために反復的な草稿修正ワークフローを利用するシナリオを調査する。このようなLLMレビューの「ゲーミング」は特定のシナリオで効果的であり、最大35%の論文において総合スコアの統計的に有意な向上をもたらすことが明らかになった。コードは以下のURLで公開している：https://github.com/uhh-hcds/reviewarcade

English

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.