ReviewerGPT? Uno studio esplorativo sull'utilizzo di modelli linguistici di grandi dimensioni per la revisione di articoli scientifici

Abstract

Considerando la rapida ascesa dei grandi modelli linguistici (LLM), ci poniamo la seguente domanda: (Come) possono i grandi modelli linguistici aiutare nella revisione di articoli scientifici o proposte? Iniziamo conducendo alcuni studi pilota in cui scopriamo che (i) GPT-4 supera altri LLM (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), e (ii) formulare prompt con una domanda specifica (ad esempio, per identificare errori) produce risultati migliori rispetto a semplicemente chiedere di scrivere una recensione. Con queste intuizioni, studiamo l'uso degli LLM (in particolare, GPT-4) per tre compiti: 1. Identificazione di errori: Costruiamo 13 brevi articoli di informatica, ciascuno con un errore inserito deliberatamente, e chiediamo all'LLM di verificarne la correttezza. Osserviamo che l'LLM identifica errori in 7 di essi, comprendendo sia errori matematici che concettuali. 2. Verifica di checklist: Assegniamo all'LLM il compito di verificare 16 domande chiuse di una checklist nelle rispettive sezioni di 15 articoli di NeurIPS 2022. Rileviamo che, su 119 coppie {domanda della checklist, articolo}, l'LLM ha raggiunto un'accuratezza dell'86,6%. 3. Scelta del "miglior" articolo: Generiamo 10 coppie di abstract, progettando deliberatamente ciascuna coppia in modo che un abstract fosse chiaramente superiore all'altro. Tuttavia, l'LLM ha faticato a discernere queste differenze relativamente semplici, commettendo errori nelle valutazioni per 6 delle 10 coppie. Sulla base di questi esperimenti, riteniamo che gli LLM abbiano un potenziale promettente come assistenti di revisione per compiti specifici, ma non (ancora) per valutazioni complete di articoli o proposte.

English

Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the "better" paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals.

ReviewerGPT? Uno studio esplorativo sull'utilizzo di modelli linguistici di grandi dimensioni per la revisione di articoli scientifici

ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

Abstract

Support