ReviewerGPT?关于使用大型语言模型进行论文审阅的探索性研究
ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing
June 1, 2023
作者: Ryan Liu, Nihar B. Shah
cs.AI
摘要
鉴于大型语言模型(LLMs)的迅速崛起,我们研究以下问题:大型语言模型如何帮助科学论文或提案的审阅?我们首先进行了一些试点研究,发现(i)GPT-4的表现优于其他LLMs(Bard、Vicuna、Koala、Alpaca、LLaMa、Dolly、OpenAssistant、StableLM),以及(ii)通过特定问题提示(例如,识别错误)的表现优于简单要求撰写评论。基于这些见解,我们研究了LLMs(特别是GPT-4)在三个任务中的应用:
1. 识别错误:我们编写了13篇短的计算机科学论文,每篇故意插入一个错误,并要求LLM检查这些论文的正确性。我们观察到,LLM在其中发现了7个错误,涵盖了数学和概念性错误。
2. 验证检查表:我们要求LLM验证15篇NeurIPS 2022论文各自部分中的16个封闭式检查表问题。在119个{检查表问题,论文}对中,LLM的准确率为86.6%。
3. 选择“更好”的论文:我们生成了10对摘要,故意设计每对摘要以一篇明显优于另一篇。然而,LLM在准确辨别这些相对简单的区别方面表现出困难,对10对中的6对评估中出现错误。
根据这些实验,我们认为LLMs在特定审阅任务中作为审阅助手具有潜在用途,但尚不适用于对论文或提案的完整评估。
English
Given the rapid ascent of large language models (LLMs), we study the
question: (How) can large language models help in reviewing of scientific
papers or proposals? We first conduct some pilot studies where we find that (i)
GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly,
OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to
identify errors) outperforms prompting to simply write a review. With these
insights, we study the use of LLMs (specifically, GPT-4) for three tasks:
1. Identifying errors: We construct 13 short computer science papers each
with a deliberately inserted error, and ask the LLM to check for the
correctness of these papers. We observe that the LLM finds errors in 7 of them,
spanning both mathematical and conceptual errors.
2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist
questions in the respective sections of 15 NeurIPS 2022 papers. We find that
across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy.
3. Choosing the "better" paper: We generate 10 pairs of abstracts,
deliberately designing each pair in such a way that one abstract was clearly
superior than the other. The LLM, however, struggled to discern these
relatively straightforward distinctions accurately, committing errors in its
evaluations for 6 out of the 10 pairs.
Based on these experiments, we think that LLMs have a promising use as
reviewing assistants for specific reviewing tasks, but not (yet) for complete
evaluations of papers or proposals.