ReviewerGPT?一項關於使用大型語言模型進行論文審查的探索性研究
ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing
June 1, 2023
作者: Ryan Liu, Nihar B. Shah
cs.AI
摘要
鑑於大型語言模型(LLMs)的快速崛起,我們研究以下問題:大型語言模型如何幫助科學論文或提案的審查?我們首先進行了一些試點研究,發現(i)GPT-4在性能上優於其他LLMs(Bard、Vicuna、Koala、Alpaca、LLaMa、Dolly、OpenAssistant、StableLM),以及(ii)通過特定問題的提示(例如,識別錯誤)優於提示簡單撰寫評論。基於這些見解,我們研究了LLMs(具體來說是GPT-4)在三個任務中的應用:
1. 識別錯誤:我們編寫了13篇短的計算機科學論文,每篇故意插入一個錯誤,要求LLM檢查這些論文的正確性。我們觀察到LLM在其中發現了7個錯誤,涵蓋了數學和概念錯誤。
2. 驗證檢查表:我們要求LLM驗證15篇NeurIPS 2022論文各個部分中的16個閉合式檢查表問題。我們發現在119個{檢查表問題,論文}對中,LLM的準確率為86.6%。
3. 選擇“更好”的論文:我們生成了10對摘要,故意設計每對摘要以一個明顯優於另一個。然而,LLM在準確辨別這些相對簡單的區別方面遇到困難,對其中的10對中有6對的評估中出現錯誤。
基於這些實驗,我們認為LLMs在特定審查任務中作為審查助手有潛在用途,但尚不適用於對論文或提案的完整評估。
English
Given the rapid ascent of large language models (LLMs), we study the
question: (How) can large language models help in reviewing of scientific
papers or proposals? We first conduct some pilot studies where we find that (i)
GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly,
OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to
identify errors) outperforms prompting to simply write a review. With these
insights, we study the use of LLMs (specifically, GPT-4) for three tasks:
1. Identifying errors: We construct 13 short computer science papers each
with a deliberately inserted error, and ask the LLM to check for the
correctness of these papers. We observe that the LLM finds errors in 7 of them,
spanning both mathematical and conceptual errors.
2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist
questions in the respective sections of 15 NeurIPS 2022 papers. We find that
across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy.
3. Choosing the "better" paper: We generate 10 pairs of abstracts,
deliberately designing each pair in such a way that one abstract was clearly
superior than the other. The LLM, however, struggled to discern these
relatively straightforward distinctions accurately, committing errors in its
evaluations for 6 out of the 10 pairs.
Based on these experiments, we think that LLMs have a promising use as
reviewing assistants for specific reviewing tasks, but not (yet) for complete
evaluations of papers or proposals.