FreshLLMs: 検索エンジンによる拡張を用いた大規模言語モデルのリフレッシュ

要旨

大規模言語モデル（LLM）の多くは、一度訓練された後は更新されることがないため、刻一刻と変化する世界に動的に対応する能力を欠いています。本研究では、現在の世界知識を試す質問に対する回答という文脈において、LLMが生成するテキストの事実性について詳細な調査を行います。具体的には、急速に変化する世界知識を必要とする質問や、誤った前提を否定する必要がある質問など、多様な質問と回答タイプを網羅した新しい動的QAベンチマーク「FreshQA」を導入します。閉鎖型およびオープンソースの多様なLLMを、正しさと幻覚（hallucination）の両方を測定できる二段階評価手順でベンチマークします。5万件以上の人間による評価を通じて、これらのモデルの限界を明らかにし、改善の余地が大きいことを示します。例えば、すべてのモデル（モデルサイズに関わらず）は、急速に変化する知識や誤った前提を含む質問に苦戦します。これらの結果に基づき、検索エンジンから取得した関連性の高い最新情報をプロンプトに組み込むことで、LLMのFreshQAにおけるパフォーマンスを大幅に向上させるシンプルなFew-shotプロンプト手法「FreshPrompt」を提案します。実験結果から、FreshPromptは、Self-Ask（Press et al., 2022）のような競合する検索エンジン補完型プロンプト手法や、Perplexity.AIのような商用システムを上回ることが示されました。FreshPromptのさらなる分析から、取得した証拠の数とその順序が、LLMが生成する回答の正しさに重要な役割を果たすことが明らかになりました。また、LLMに簡潔で直接的な回答を生成するよう指示することは、より冗長な回答を促す場合と比べて、幻覚を減らすのに役立ちます。今後の研究を促進するため、FreshQAをgithub.com/freshllms/freshqaで公開し、定期的に更新することを約束します。

English

Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.

FreshLLMs: 検索エンジンによる拡張を用いた大規模言語モデルのリフレッシュ

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

要旨

Support