评论回廊：论LLM评论的人类对齐与可博弈性

摘要

近年来，基于大语言模型（LLM）生成的学术论文评审正受到广泛关注，甚至被主流会议正式试点采用。我们必须意识到，不仅评审者会借助LLM辅助工作，作者也会在投稿前使用LLM修改论文。本研究基于2025年ACL滚动审稿（ARR）的论文开展实证实验，从作者与评审者双重视角评估LLM生成的评审意见。首先，我们发现LLM评审与人工评审一致性有限——在最佳情况下虽有一定匹配度，但不同提示词和模型下的LLM-人类一致性差异显著。最后，我们探究了作者采用“草稿-修订”迭代工作流，根据LLM评审意见改进投稿的场景。结果表明，这种针对LLM评审的“策略性利用”在特定情形下效果显著，可使高达35%的论文获得整体评分的统计显著提升。本文代码已开源：https://github.com/uhh-hcds/reviewarcade。

English

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.