FreshLLMs：使用搜索引擎增強的大型語言模型更新

摘要

大多數大型語言模型（LLMs）只會訓練一次，並且不會進行更新；因此，它們缺乏動態適應我們不斷變化的世界的能力。在這項工作中，我們對LLM生成的文本的事實性進行了詳細研究，並在回答測試當前世界知識的問題的情況下進行了研究。具體來說，我們引入了FreshQA，這是一個新穎的動態QA基準，包含各種問題和答案類型，包括需要快速變化的世界知識以及需要揭穿虛假前提的問題。我們在兩種模式的評估程序下對各種開源和專有的LLMs進行基準測試，這使我們能夠測量正確性和幻覺。通過涉及超過50K個判斷的人類評估，我們揭示了這些模型的局限性，並展示了顯著的改進空間：例如，所有模型（無論模型大小）在涉及快速變化的知識和虛假前提的問題上都遇到困難。受到這些結果的激勵，我們提出了FreshPrompt，這是一種簡單的少數提示方法，通過將從搜索引擎檢索的相關和最新信息納入提示，顯著提高了LLM在FreshQA上的性能。我們的實驗表明，FreshPrompt在FreshQA上的表現優於其他競爭的搜索引擎增強提示方法，如Self-Ask（Press等，2022年），以及商業系統如Perplexity.AI。對FreshPrompt的進一步分析顯示，檢索證據的數量和其順序對影響LLM生成的答案的正確性起著關鍵作用。此外，指導LLM生成簡潔直接的答案有助於減少幻覺，相較於鼓勵更冗長的答案。為了促進未來的工作，我們在github.com/freshllms/freshqa上發布了FreshQA，並承諾定期更新。

English

Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.

FreshLLMs：使用搜索引擎增強的大型語言模型更新

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

摘要

Support