리뷰 아케이드: LLM 리뷰의 인간 정렬성과 게임 가능성에 관하여

초록

LLM이 생성한 과학 논문 리뷰는 상당한 주목을 받고 있으며, 주요 학술대회에서 공식적으로 시범 운영되기까지 하고 있습니다. 우리는 리뷰어가 LLM 도움을 받을 뿐만 아니라, 저자들도 논문 제출 전에 LLM을 사용하여 수정할 것이라고 가정해야 합니다. 본 연구에서는 2025 ACL Rolling Review (ARR) 논문을 대상으로 저자와 리뷰어 관점에서 LLM 리뷰를 평가하는 실증적 실험을 수행합니다. 첫째, LLM 리뷰와 인간 리뷰 간의 제한된 정합성을 확인했습니다. 최상의 시나리오에서는 정합성이 합리적이었습니다. 그러나 LLM-인간 정합성은 프롬프트와 모델에 따라 상당히 달라진다는 점도 발견했습니다. 마지막으로, 저자가 LLM 리뷰에 따라 초안-수정 워크플로를 반복적으로 적용하여 제출물을 개선하는 시나리오를 조사했습니다. 이러한 LLM 리뷰 "공략(gaming)"은 특정 시나리오에서 효과적일 수 있으며, 최대 35%의 논문에서 전체 점수가 통계적으로 유의미하게 증가하는 결과를 보였습니다. 코드를 공개합니다: https://github.com/uhh-hcds/reviewarcade.

English

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.