ChatPaper.aiChatPaper

ReviewerGPT?一項關於使用大型語言模型進行論文審查的探索性研究

ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

June 1, 2023
作者: Ryan Liu, Nihar B. Shah
cs.AI

摘要

鑑於大型語言模型(LLMs)的快速崛起,我們研究以下問題:大型語言模型如何幫助科學論文或提案的審查?我們首先進行了一些試點研究,發現(i)GPT-4在性能上優於其他LLMs(Bard、Vicuna、Koala、Alpaca、LLaMa、Dolly、OpenAssistant、StableLM),以及(ii)通過特定問題的提示(例如,識別錯誤)優於提示簡單撰寫評論。基於這些見解,我們研究了LLMs(具體來說是GPT-4)在三個任務中的應用: 1. 識別錯誤:我們編寫了13篇短的計算機科學論文,每篇故意插入一個錯誤,要求LLM檢查這些論文的正確性。我們觀察到LLM在其中發現了7個錯誤,涵蓋了數學和概念錯誤。 2. 驗證檢查表:我們要求LLM驗證15篇NeurIPS 2022論文各個部分中的16個閉合式檢查表問題。我們發現在119個{檢查表問題,論文}對中,LLM的準確率為86.6%。 3. 選擇“更好”的論文:我們生成了10對摘要,故意設計每對摘要以一個明顯優於另一個。然而,LLM在準確辨別這些相對簡單的區別方面遇到困難,對其中的10對中有6對的評估中出現錯誤。 基於這些實驗,我們認為LLMs在特定審查任務中作為審查助手有潛在用途,但尚不適用於對論文或提案的完整評估。
English
Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the "better" paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals.
PDF20December 15, 2024