ReviewerGPT？关于使用大型语言模型进行论文审阅的探索性研究

摘要

鉴于大型语言模型（LLMs）的迅速崛起，我们研究以下问题：大型语言模型如何帮助科学论文或提案的审阅？我们首先进行了一些试点研究，发现（i）GPT-4的表现优于其他LLMs（Bard、Vicuna、Koala、Alpaca、LLaMa、Dolly、OpenAssistant、StableLM），以及（ii）通过特定问题提示（例如，识别错误）的表现优于简单要求撰写评论。基于这些见解，我们研究了LLMs（特别是GPT-4）在三个任务中的应用： 1. 识别错误：我们编写了13篇短的计算机科学论文，每篇故意插入一个错误，并要求LLM检查这些论文的正确性。我们观察到，LLM在其中发现了7个错误，涵盖了数学和概念性错误。 2. 验证检查表：我们要求LLM验证15篇NeurIPS 2022论文各自部分中的16个封闭式检查表问题。在119个{检查表问题，论文}对中，LLM的准确率为86.6%。 3. 选择“更好”的论文：我们生成了10对摘要，故意设计每对摘要以一篇明显优于另一篇。然而，LLM在准确辨别这些相对简单的区别方面表现出困难，对10对中的6对评估中出现错误。根据这些实验，我们认为LLMs在特定审阅任务中作为审阅助手具有潜在用途，但尚不适用于对论文或提案的完整评估。

English

Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the "better" paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals.

ReviewerGPT？关于使用大型语言模型进行论文审阅的探索性研究

ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

摘要

Support