ReviewerGPT？一項關於使用大型語言模型進行論文審查的探索性研究

摘要

鑑於大型語言模型（LLMs）的快速崛起，我們研究以下問題：大型語言模型如何幫助科學論文或提案的審查？我們首先進行了一些試點研究，發現（i）GPT-4在性能上優於其他LLMs（Bard、Vicuna、Koala、Alpaca、LLaMa、Dolly、OpenAssistant、StableLM），以及（ii）通過特定問題的提示（例如，識別錯誤）優於提示簡單撰寫評論。基於這些見解，我們研究了LLMs（具體來說是GPT-4）在三個任務中的應用： 1. 識別錯誤：我們編寫了13篇短的計算機科學論文，每篇故意插入一個錯誤，要求LLM檢查這些論文的正確性。我們觀察到LLM在其中發現了7個錯誤，涵蓋了數學和概念錯誤。 2. 驗證檢查表：我們要求LLM驗證15篇NeurIPS 2022論文各個部分中的16個閉合式檢查表問題。我們發現在119個{檢查表問題，論文}對中，LLM的準確率為86.6%。 3. 選擇“更好”的論文：我們生成了10對摘要，故意設計每對摘要以一個明顯優於另一個。然而，LLM在準確辨別這些相對簡單的區別方面遇到困難，對其中的10對中有6對的評估中出現錯誤。基於這些實驗，我們認為LLMs在特定審查任務中作為審查助手有潛在用途，但尚不適用於對論文或提案的完整評估。

English

Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the "better" paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals.

ReviewerGPT？一項關於使用大型語言模型進行論文審查的探索性研究

ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

摘要

Support