ReviewerGPT? 대형 언어 모델을 논문 리뷰에 활용하는 탐구적 연구

초록

대규모 언어 모델(LLM)의 급속한 발전을 고려하여, 우리는 다음과 같은 질문을 연구합니다: (어떻게) 대규모 언어 모델이 과학 논문 또는 제안서의 리뷰에 도움을 줄 수 있을까? 먼저, 몇 가지 파일럿 연구를 수행한 결과, (i) GPT-4가 다른 LLM들(Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM)보다 우수한 성능을 보였으며, (ii) 특정 질문(예: 오류 식별)을 통해 프롬프팅하는 것이 단순히 리뷰를 작성하도록 프롬프팅하는 것보다 더 나은 성과를 보였습니다. 이러한 통찰을 바탕으로, 우리는 LLM(특히 GPT-4)의 활용을 세 가지 작업에 대해 연구했습니다: 1. **오류 식별**: 우리는 각각 의도적으로 오류를 삽입한 13개의 짧은 컴퓨터 과학 논문을 구성하고, LLM에게 이 논문들의 정확성을 확인하도록 요청했습니다. 그 결과, LLM은 수학적 오류와 개념적 오류를 포함하여 7개의 논문에서 오류를 발견했습니다. 2. **체크리스트 검증**: 우리는 LLM에게 15개의 NeurIPS 2022 논문의 각 섹션에서 16개의 폐쇄형 체크리스트 질문을 검증하도록 요청했습니다. 119개의 {체크리스트 질문, 논문} 쌍에서 LLM은 86.6%의 정확도를 보였습니다. 3. **"더 나은" 논문 선택**: 우리는 10쌍의 초록을 생성했으며, 각 쌍에서 하나의 초록이 명확히 더 우수하도록 설계했습니다. 그러나 LLM은 이러한 비교적 간단한 차이를 정확하게 구분하는 데 어려움을 겪었고, 10쌍 중 6쌍에서 평가 오류를 범했습니다. 이러한 실험을 바탕으로, 우리는 LLM이 특정 리뷰 작업에 대한 리뷰 보조 도구로서 유망한 가능성을 가지고 있지만, 아직 논문 또는 제안서의 완전한 평가에는 적합하지 않다고 생각합니다.

English

Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the "better" paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals.

ReviewerGPT? 대형 언어 모델을 논문 리뷰에 활용하는 탐구적 연구

ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

초록

Support