为事实性微调语言模型
Fine-tuning Language Models for Factuality
November 14, 2023
作者: Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn
cs.AI
摘要
大型预训练语言模型(LLMs)的流畅性和创造力已经导致它们被广泛使用,有时甚至作为传统搜索引擎的替代品。然而,语言模型容易产生令人信服但事实不准确的言论,通常被称为“幻觉”。这些错误可能无意中传播错误信息或有害地固化误解。此外,对模型响应进行手动事实核查是一项耗时的过程,使得人工事实标签的获取成本很高。在这项工作中,我们对语言模型进行微调,使其更具事实性,而无需人工标记,并针对比过去工作更开放的生成设置。我们利用自然语言处理中的两项关键最新创新来实现这一点。首先,几项最近的研究提出了通过测量与外部知识库或简单地大型模型的置信度一致性来判断开放式文本事实性的方法。其次,直接偏好优化算法使得能够直接在除了受监督模仿之外的目标上对语言模型进行简单微调,使用对可能的模型响应的偏好排序。我们展示了通过自动生成的事实性偏好排序进行学习,无论是通过现有检索系统还是我们的新颖无检索方法生成的,都显著提高了Llama-2在保留主题上的事实性(生成言论中正确的百分比)与针对事实性的RLHF或解码策略相比。在7B规模上,与Llama-2-chat相比,在生成传记时观察到事实错误率减少了58%,在回答医学问题时减少了40%。
English
The fluency and creativity of large pre-trained language models (LLMs) have
led to their widespread use, sometimes even as a replacement for traditional
search engines. Yet language models are prone to making convincing but
factually inaccurate claims, often referred to as 'hallucinations.' These
errors can inadvertently spread misinformation or harmfully perpetuate
misconceptions. Further, manual fact-checking of model responses is a
time-consuming process, making human factuality labels expensive to acquire. In
this work, we fine-tune language models to be more factual, without human
labeling and targeting more open-ended generation settings than past work. We
leverage two key recent innovations in NLP to do so. First, several recent
works have proposed methods for judging the factuality of open-ended text by
measuring consistency with an external knowledge base or simply a large model's
confidence scores. Second, the direct preference optimization algorithm enables
straightforward fine-tuning of language models on objectives other than
supervised imitation, using a preference ranking over possible model responses.
We show that learning from automatically generated factuality preference
rankings, generated either through existing retrieval systems or our novel
retrieval-free approach, significantly improves the factuality (percent of
generated claims that are correct) of Llama-2 on held-out topics compared with
RLHF or decoding strategies targeted at factuality. At 7B scale, compared to
Llama-2-chat, we observe 58% and 40% reduction in factual error rate when
generating biographies and answering medical questions, respectively.