ReviewerGPT？大規模言語モデルを用いた論文査読に関する探索的研究

要旨

大規模言語モデル（LLM）の急速な台頭を踏まえ、我々は以下の問いを研究する：大規模言語モデルは、科学論文や研究提案の査読においてどのように役立つのか？まず、いくつかのパイロット研究を実施し、(i) GPT-4が他のLLM（Bard、Vicuna、Koala、Alpaca、LLaMa、Dolly、OpenAssistant、StableLM）を上回ること、および(ii) 特定の質問（例えば、誤りを特定するよう促す）を提示することが、単にレビューを書くよう促すよりも優れていることを明らかにした。これらの知見をもとに、LLM（特にGPT-4）の使用を以下の3つのタスクで検討した： 1. **誤りの特定**：13の短い計算機科学論文を作成し、それぞれに意図的に誤りを挿入した。LLMにこれらの論文の正しさを確認させたところ、数学的および概念的な誤りを含む7つの論文で誤りを発見した。 2. **チェックリストの検証**：15のNeurIPS 2022論文の各セクションにおいて、16の閉じた質問からなるチェックリストをLLMに検証させた。119の{チェックリスト質問、論文}ペアにおいて、LLMは86.6%の精度を示した。 3. **「より優れた」論文の選択**：10組のアブストラクトを生成し、各組において一方が明らかに他方よりも優れているように設計した。しかし、LLMはこれらの比較的単純な区別を正確に見分けることに苦戦し、10組中6組で評価誤りを犯した。これらの実験に基づき、LLMは特定の査読タスクにおいて有望なアシスタントとして活用できるが、論文や提案の完全な評価には（まだ）適していないと考えられる。

English

Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the "better" paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals.

ReviewerGPT？大規模言語モデルを用いた論文査読に関する探索的研究

ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

要旨

Support