評論街機：論LLM評論的人類對齊與可博弈性

摘要

LLM生成的科學論文審稿意見正獲得廣泛關注，甚至被大型會議正式試行。我們必須假設，不僅審稿人在使用LLM輔助，作者在提交論文前也會使用LLM來修改稿件。本研究針對2025年ACL滾動審稿（ARR）的論文進行實證實驗，從作者與審稿人雙方的角度評估LLM生成的審稿意見。首先，我們發現LLM審稿意見與人類審稿意見的契合度有限。在最佳情況下，兩者的契合度尚可接受。然而，我們也觀察到LLM與人類審稿意見的契合度會因提示詞與模型的不同而有顯著差異。最後，我們探討了作者採用迭代式草稿-修訂工作流，根據LLM審稿意見來改進投稿稿件的情境。結果顯示，這種「操控」LLM審稿意見的方式在特定情境下確實有效，能使高達35%論文的整體評分出現統計上顯著的提升。我們已公開程式碼：https://github.com/uhh-hcds/reviewarcade。

English

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.