AI 리뷰어의 한계와 기회: 45명의 전문 과학자들과 함께한 Nature 계열 논문 리뷰 검토

초록

인공지능(AI) 능력의 발전에 따라 과학적 동료 검토(peer review)에 AI 리뷰어가 배치되기 시작했지만, 그 역량과 신뢰성은 여전히 의문시되고 있다. 많은 과학자들은 AI 리뷰어를 연구를 평가할 전문성이 없는 확률적 시스템으로 보는 반면, 다른 연구자들은 구체적 증거 없이 그 준비 상태에 대해 더 낙관적이다. AI 리뷰어가 무엇을 잘 수행하고, 어디에서 한계를 보이며, 어떤 과제가 남아 있는지를 이해하는 것이 필수적이다. 그러나 기존의 AI 리뷰어 평가는 그 판정이 인간 판정과 일치하는지(예: 점수 일치도, 수용 예측)에 초점을 맞추어 왔으며, 이는 그 역량과 한계를 특성화하기에 충분하지 않다. 본 논문에서는 물리 과학, 생명 과학, 보건 과학 분야의 45명의 도메인 과학자가 469시간을 투자하여 Nature 계열 논문 82편의 인간 작성 리뷰와 AI 생성 리뷰로부터 추출한 2,960개의 개별 비평(각각 논문의 특정 측면을 대상으로 함)을 정확성, 중요성, 증거의 충분성 측면에서 평가한 대규모 전문가 주석 연구를 통해 이러한 격차를 해소한다. 세 가지 차원을 모두 종합한 합성 지표에서 GPT-5.2 기반 리뷰 에이전트는 각 논문의 최고 평가를 받은 인간 리뷰어보다 높은 점수를 기록했으며(60.0% 대 48.2%, p = 0.009), 세 가지 AI 리뷰어(Gemini 3.0 Pro 및 Claude Opus 4.5 포함)는 모든 차원에서 최저 평가를 받은 인간 리뷰어를 능가했다. AI 리뷰어의 정확한 비평은 또한 더 자주 중요하고 증거가 충분한 것으로 평가되었으며, 인간이 제기하지 않는 별개의 26%의 문제를 표면화했다. 그러나 AI 리뷰어는 인간보다 훨씬 더 많은 중복을 보였으며(교차 리뷰어 쌍 기준 21% 대 3%), 인간이 공유하지 않는 16가지의 반복적 약점(예: 제한된 하위 분야 지식, 여러 파일에 걸친 긴 문맥 관리 부족, 사소한 문제에 대한 지나치게 비판적인 태도)을 나타냈다. 전반적으로, 본 연구 결과는 현재의 AI 리뷰어를 인간 리뷰어의 대체재가 아닌 보완재로 위치 짓는다.

English

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.