FreshLLMs: 검색 엔진 증강을 통한 대형 언어 모델 개선

초록

대부분의 대규모 언어 모델(LLM)은 한 번 훈련된 후 업데이트되지 않기 때문에, 끊임없이 변화하는 세상에 동적으로 적응할 수 있는 능력이 부족합니다. 본 연구에서는 현재의 세계 지식을 테스트하는 질문에 답변하는 맥락에서 LLM이 생성한 텍스트의 사실성을 상세히 분석합니다. 구체적으로, 우리는 빠르게 변화하는 세계 지식이 필요한 질문과 잘못된 전제를 반박해야 하는 질문을 포함한 다양한 유형의 질문과 답변을 포괄하는 새로운 동적 QA 벤치마크인 FreshQA를 소개합니다. 우리는 폐쇄형과 오픈소스 LLM을 모두 대상으로 두 가지 모드의 평가 절차를 통해 정확성과 환각(hallucination)을 측정합니다. 5만 건 이상의 인간 평가를 통해 이러한 모델의 한계를 밝히고 상당한 개선의 여지가 있음을 보여줍니다: 예를 들어, 모든 모델(모델 크기와 무관하게)은 빠르게 변화하는 지식과 잘못된 전제가 포함된 질문에서 어려움을 겪습니다. 이러한 결과를 바탕으로, 우리는 검색 엔진에서 검색된 관련성 있고 최신 정보를 프롬프트에 통합함으로써 LLM의 FreshQA 성능을 크게 향상시키는 간단한 퓨샷 프롬프팅 방법인 FreshPrompt를 제시합니다. 우리의 실험은 FreshPrompt가 Self-Ask(Press et al., 2022)와 같은 경쟁 검색 엔진 기반 프롬프팅 방법 및 Perplexity.AI와 같은 상용 시스템을 모두 능가함을 보여줍니다. FreshPrompt에 대한 추가 분석은 검색된 증거의 수와 순서가 LLM이 생성한 답변의 정확성에 중요한 역할을 한다는 것을 보여줍니다. 또한, LLM에게 간결하고 직접적인 답변을 생성하도록 지시하는 것이 더 장황한 답변을 유도하는 것보다 환각을 줄이는 데 도움이 됩니다. 향후 연구를 위해, 우리는 FreshQA를 github.com/freshllms/freshqa에서 공개하고 정기적으로 업데이트할 것을 약속합니다.

English

Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at github.com/freshllms/freshqa and commit to updating it at regular intervals.

FreshLLMs: 검색 엔진 증강을 통한 대형 언어 모델 개선

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

초록

Support