FreshLLMs：利用搜索引擎增强大型语言模型

摘要

大多数大型语言模型（LLMs）只训练一次，之后不再更新；因此，它们缺乏动态适应我们不断变化的世界的能力。在这项工作中，我们对LLM生成的文本的事实性进行了详细研究，重点是回答检验当前世界知识的问题。具体而言，我们引入了FreshQA，这是一个全新的动态问答基准，涵盖了各种问题和答案类型，包括需要快速变化的世界知识以及需要揭穿错误前提的问题。我们在两种模式的评估过程中对各种闭源和开源LLMs进行基准测试，这使我们能够同时衡量正确性和幻觉。通过涉及5万多个判断的人类评估，我们揭示了这些模型的局限性，并展示了改进的重要空间：例如，所有模型（无论模型大小）在涉及快速变化知识和错误前提的问题上都面临困难。受这些结果的启发，我们提出了FreshPrompt，这是一种简单的少样本提示方法，通过将从搜索引擎检索到的相关和最新信息合并到提示中，显著提高了LLM在FreshQA上的性能。我们的实验表明，FreshPrompt在FreshQA上的表现优于竞争的搜索引擎增强提示方法，如Self-Ask（Press等，2022年），以及商业系统，如Perplexity.AI。对FreshPrompt的进一步分析显示，检索到的证据数量和它们的顺序对影响LLM生成的答案的正确性起着关键作用。此外，指导LLM生成简明直接的答案有助于减少幻觉，而不是鼓励更冗长的答案。为了促进未来的工作，我们在github.com/freshllms/freshqa上发布了FreshQA，并承诺定期更新。

English

Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.

FreshLLMs：利用搜索引擎增强大型语言模型

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

摘要

Support