Mask-DPO:大型語言模型的通用細粒度事實性對齊
Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs
March 4, 2025
作者: Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen
cs.AI
摘要
大型語言模型(LLMs)在作為各領域的AI助手時,會出現幻覺(即不真實或無意義的信息)。由於這些幻覺總是伴隨著LLM回應中的真實內容,以往基於回應層級偏好學習的事實對齊方法在訓練過程中不可避免地引入了噪音。因此,本文提出了一種基於直接偏好優化(DPO)的細粒度事實對齊方法,稱為Mask-DPO。通過將句子層級的事實性作為掩碼信號,Mask-DPO僅從偏好樣本中的事實正確句子中學習,並避免對非偏好樣本中的真實內容進行懲罰,從而解決了偏好學習中的模糊性。大量實驗結果表明,Mask-DPO能顯著提升LLMs對來自域內和域外數據集問題回應的事實性,儘管這些問題及其相關主題在訓練期間並未見過。僅在ANAH訓練集上訓練後,Llama3.1-8B-Instruct在ANAH測試集上的得分從49.19%提升至77.53%,甚至超過了Llama3.1-70B-Instruct的得分(53.44%),同時其在域外傳記數據集上的FactScore也從30.29%提升至39.39%。我們進一步研究了Mask-DPO在不同訓練樣本擴展策略下的泛化特性,發現擴展數據集中的主題數量比問題數量更為有效。我們提出了一個關於LLMs事實對齊作用的假設,探討了這一現象的意義,並進行了概念驗證實驗以驗證之。我們希望該方法及發現能為未來擴展事實對齊的研究鋪平道路。
English
Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or
nonsensical information) when serving as AI assistants in various domains.
Since hallucinations always come with truthful content in the LLM responses,
previous factuality alignment methods that conduct response-level preference
learning inevitably introduced noises during training. Therefore, this paper
proposes a fine-grained factuality alignment method based on Direct Preference
Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as
mask signals, Mask-DPO only learns from factually correct sentences in the
preferred samples and prevents the penalty on factual contents in the not
preferred samples, which resolves the ambiguity in the preference learning.
Extensive experimental results demonstrate that Mask-DPO can significantly
improve the factuality of LLMs responses to questions from both in-domain and
out-of-domain datasets, although these questions and their corresponding topics
are unseen during training. Only trained on the ANAH train set, the score of
Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%,
even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its
FactScore on the out-of-domain Biography dataset is also improved from 30.29%
to 39.39%. We further study the generalization property of Mask-DPO using
different training sample scaling strategies and find that scaling the number
of topics in the dataset is more effective than the number of questions. We
provide a hypothesis of what factual alignment is doing with LLMs, on the
implication of this phenomenon, and conduct proof-of-concept experiments to
verify it. We hope the method and the findings pave the way for future research
on scaling factuality alignment.