微調語言模型以提升事實性
Fine-tuning Language Models for Factuality
November 14, 2023
作者: Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn
cs.AI
摘要
大型預訓練語言模型(LLMs)的流暢度和創造力已導致它們被廣泛使用,有時甚至取代傳統搜索引擎。然而,語言模型容易做出令人信服但事實不準確的主張,通常被稱為「幻覺」。這些錯誤可能無意中傳播錯誤信息或有害地固化誤解。此外,對模型回應進行手動事實檢查是一個耗時的過程,使得人工事實標籤的獲取成本很高。在這項工作中,我們對語言模型進行微調,使其更具事實性,而無需人類標記,並針對比過去工作更開放的生成設置。我們利用自然語言處理(NLP)中的兩個關鍵最新創新來實現這一點。首先,一些最近的研究提出了評判開放式文本事實性的方法,通過測量與外部知識庫或僅僅大型模型的信心分數的一致性。其次,直接偏好優化算法使語言模型可以直接在非監督模仿以外的目標上進行簡單微調,使用對可能的模型回應的偏好排序。我們展示了從自動生成的事實性偏好排序中學習,通過現有檢索系統或我們的新穎無檢索方法生成,相對於針對事實性的RLHF或解碼策略,明顯提高了Llama-2在保留主題上的事實性(生成主張中正確的百分比)。在7B規模上,與Llama-2-chat相比,我們觀察到在生成傳記和回答醫學問題時,事實性錯誤率分別減少了58%和40%。
English
The fluency and creativity of large pre-trained language models (LLMs) have
led to their widespread use, sometimes even as a replacement for traditional
search engines. Yet language models are prone to making convincing but
factually inaccurate claims, often referred to as 'hallucinations.' These
errors can inadvertently spread misinformation or harmfully perpetuate
misconceptions. Further, manual fact-checking of model responses is a
time-consuming process, making human factuality labels expensive to acquire. In
this work, we fine-tune language models to be more factual, without human
labeling and targeting more open-ended generation settings than past work. We
leverage two key recent innovations in NLP to do so. First, several recent
works have proposed methods for judging the factuality of open-ended text by
measuring consistency with an external knowledge base or simply a large model's
confidence scores. Second, the direct preference optimization algorithm enables
straightforward fine-tuning of language models on objectives other than
supervised imitation, using a preference ranking over possible model responses.
We show that learning from automatically generated factuality preference
rankings, generated either through existing retrieval systems or our novel
retrieval-free approach, significantly improves the factuality (percent of
generated claims that are correct) of Llama-2 on held-out topics compared with
RLHF or decoding strategies targeted at factuality. At 7B scale, compared to
Llama-2-chat, we observe 58% and 40% reduction in factual error rate when
generating biographies and answering medical questions, respectively.