Review Arcade: Sobre o Alinhamento Humano e a Jogabilidade das Revisões de LLMs

Resumo

Avaliações geradas por LLM para artigos científicos estão ganhando considerável força e estão até sendo testadas oficialmente por grandes conferências. Temos que assumir que não apenas revisores estão usando assistência de LLM, mas também que autores usam LLMs para revisar seus artigos antes da submissão. Neste trabalho, realizamos experimentos empíricos em artigos do ACL Rolling Review (ARR) de 2025 para avaliar revisões de LLM tanto da perspectiva do autor quanto do revisor. Primeiro, identificamos um alinhamento limitado das revisões de LLM com as revisões humanas. No melhor cenário, o alinhamento é razoável. No entanto, também descobrimos que o alinhamento entre LLM e humanos varia substancialmente entre prompts e modelos. Finalmente, investigamos o cenário em que o autor utiliza um fluxo de trabalho iterativo de rascunho e revisão para melhorar a submissão de acordo com a revisão do LLM. Descobrimos que essa "manipulação" das revisões de LLM pode ser eficaz em cenários específicos, levando a um aumento estatisticamente significativo das pontuações gerais em até 35% dos artigos. Publicamos nosso código: https://github.com/uhh-hcds/reviewarcade.

English

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.