Mask-DPO: LLMの一般化可能な細粒度ファクト整合性アラインメント

要旨

大規模言語モデル（LLM）は、様々な領域でAIアシスタントとして機能する際に、幻覚（すなわち、信頼できないまたは無意味な情報）を示すことがある。幻覚は常にLLMの応答内で真実の内容と共に現れるため、従来の事実性アライメント手法では、応答レベルでの選好学習を行う際に、トレーニング中にノイズが導入されてしまう。そこで本論文では、Direct Preference Optimization（DPO）に基づく細粒度の事実性アライメント手法であるMask-DPOを提案する。Mask-DPOは、文レベルの事実性をマスク信号として組み込み、選好サンプル内の事実に基づく正しい文のみを学習し、非選好サンプル内の事実内容に対するペナルティを防ぐことで、選好学習における曖昧さを解決する。広範な実験結果は、Mask-DPOが、トレーニング中に見られなかったドメイン内およびドメイン外のデータセットからの質問に対するLLMの応答の事実性を大幅に向上させることを示している。ANAHトレーニングセットでのみトレーニングされたLlama3.1-8B-InstructのANAHテストセットでのスコアは、49.19%から77.53%に向上し、Llama3.1-70B-Instructのスコア（53.44%）を上回った。また、ドメイン外のBiographyデータセットでのFactScoreも、30.29%から39.39%に向上した。さらに、異なるトレーニングサンプルのスケーリング戦略を用いてMask-DPOの一般化特性を研究し、データセット内のトピック数をスケーリングすることが質問数をスケーリングするよりも効果的であることを発見した。我々は、LLMにおける事実性アライメントが何を行っているかについての仮説を提示し、この現象の意味合いについて考察し、それを検証するための概念実証実験を行った。本手法とその発見が、将来の事実性アライメントのスケーリング研究の道を開くことを期待する。

English

Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grained factuality alignment method based on Direct Preference Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as mask signals, Mask-DPO only learns from factually correct sentences in the preferred samples and prevents the penalty on factual contents in the not preferred samples, which resolves the ambiguity in the preference learning. Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its FactScore on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%. We further study the generalization property of Mask-DPO using different training sample scaling strategies and find that scaling the number of topics in the dataset is more effective than the number of questions. We provide a hypothesis of what factual alignment is doing with LLMs, on the implication of this phenomenon, and conduct proof-of-concept experiments to verify it. We hope the method and the findings pave the way for future research on scaling factuality alignment.

Mask-DPO: LLMの一般化可能な細粒度ファクト整合性アラインメント

Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

要旨

Support