FreshLLMs:利用搜索引擎增强大型语言模型
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
October 5, 2023
作者: Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, Thang Luong
cs.AI
摘要
大多数大型语言模型(LLMs)只训练一次,之后不再更新;因此,它们缺乏动态适应我们不断变化的世界的能力。在这项工作中,我们对LLM生成的文本的事实性进行了详细研究,重点是回答检验当前世界知识的问题。具体而言,我们引入了FreshQA,这是一个全新的动态问答基准,涵盖了各种问题和答案类型,包括需要快速变化的世界知识以及需要揭穿错误前提的问题。我们在两种模式的评估过程中对各种闭源和开源LLMs进行基准测试,这使我们能够同时衡量正确性和幻觉。通过涉及5万多个判断的人类评估,我们揭示了这些模型的局限性,并展示了改进的重要空间:例如,所有模型(无论模型大小)在涉及快速变化知识和错误前提的问题上都面临困难。受这些结果的启发,我们提出了FreshPrompt,这是一种简单的少样本提示方法,通过将从搜索引擎检索到的相关和最新信息合并到提示中,显著提高了LLM在FreshQA上的性能。我们的实验表明,FreshPrompt在FreshQA上的表现优于竞争的搜索引擎增强提示方法,如Self-Ask(Press等,2022年),以及商业系统,如Perplexity.AI。对FreshPrompt的进一步分析显示,检索到的证据数量和它们的顺序对影响LLM生成的答案的正确性起着关键作用。此外,指导LLM生成简明直接的答案有助于减少幻觉,而不是鼓励更冗长的答案。为了促进未来的工作,我们在github.com/freshllms/freshqa上发布了FreshQA,并承诺定期更新。
English
Most large language models (LLMs) are trained once and never updated; thus,
they lack the ability to dynamically adapt to our ever-changing world. In this
work, we perform a detailed study of the factuality of LLM-generated text in
the context of answering questions that test current world knowledge.
Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a
diverse range of question and answer types, including questions that require
fast-changing world knowledge as well as questions with false premises that
need to be debunked. We benchmark a diverse array of both closed and
open-source LLMs under a two-mode evaluation procedure that allows us to
measure both correctness and hallucination. Through human evaluations involving
more than 50K judgments, we shed light on limitations of these models and
demonstrate significant room for improvement: for instance, all models
(regardless of model size) struggle on questions that involve fast-changing
knowledge and false premises. Motivated by these results, we present
FreshPrompt, a simple few-shot prompting method that substantially boosts the
performance of an LLM on FreshQA by incorporating relevant and up-to-date
information retrieved from a search engine into the prompt. Our experiments
show that FreshPrompt outperforms both competing search engine-augmented
prompting methods such as Self-Ask (Press et al., 2022) as well as commercial
systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that
both the number of retrieved evidences and their order play a key role in
influencing the correctness of LLM-generated answers. Additionally, instructing
the LLM to generate concise and direct answers helps reduce hallucination
compared to encouraging more verbose answers. To facilitate future work, we
release FreshQA at github.com/freshllms/freshqa and commit to updating it at
regular intervals.