Mask-DPO: LLM의 세분화된 사실성 정렬을 위한 일반화 가능한 접근법

초록

대형 언어 모델(LLM)은 다양한 도메인에서 AI 어시스턴트로 작동할 때 환각 현상(즉, 신뢰할 수 없거나 무의미한 정보)을 보입니다. 환각 현상은 항상 LLM 응답에서 진실된 내용과 함께 나타나기 때문에, 이전의 사실성 정렬 방법들은 응답 수준의 선호도 학습을 수행하면서 필연적으로 훈련 과정에서 노이즈를 유입했습니다. 따라서 본 논문은 Direct Preference Optimization(DPO)을 기반으로 한 세분화된 사실성 정렬 방법인 Mask-DPO를 제안합니다. Mask-DPO는 문장 수준의 사실성을 마스크 신호로 통합하여, 선호된 샘플에서 사실적으로 정확한 문장만 학습하고, 비선호 샘플에서 사실적 내용에 대한 패널티를 방지함으로써 선호도 학습의 모호성을 해결합니다. 광범위한 실험 결과는 Mask-DPO가 훈련 중에 보지 못한 인도메인 및 아웃오브도메인 데이터셋의 질문에 대한 LLM 응답의 사실성을 크게 향상시킬 수 있음을 보여줍니다. ANAH 훈련 세트에서만 훈련된 Llama3.1-8B-Instruct의 ANAH 테스트 세트 점수는 49.19%에서 77.53%로 향상되었으며, Llama3.1-70B-Instruct의 점수(53.44%)를 능가했습니다. 또한 아웃오브도메인 Biography 데이터셋에서의 FactScore도 30.29%에서 39.39%로 향상되었습니다. 우리는 다양한 훈련 샘플 스케일링 전략을 사용하여 Mask-DPO의 일반화 특성을 추가로 연구했으며, 데이터셋 내 질문 수보다 주제 수를 스케일링하는 것이 더 효과적임을 발견했습니다. 우리는 LLM에서 사실성 정렬이 무엇을 하는지에 대한 가설과 이 현상의 함의를 제시하고, 이를 검증하기 위한 개념 증명 실험을 수행했습니다. 이 방법과 발견이 사실성 정렬의 확장에 대한 미래 연구의 길을 열어주기를 바랍니다.

English

Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grained factuality alignment method based on Direct Preference Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as mask signals, Mask-DPO only learns from factually correct sentences in the preferred samples and prevents the penalty on factual contents in the not preferred samples, which resolves the ambiguity in the preference learning. Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its FactScore on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%. We further study the generalization property of Mask-DPO using different training sample scaling strategies and find that scaling the number of topics in the dataset is more effective than the number of questions. We provide a hypothesis of what factual alignment is doing with LLMs, on the implication of this phenomenon, and conduct proof-of-concept experiments to verify it. We hope the method and the findings pave the way for future research on scaling factuality alignment.

Mask-DPO: LLM의 세분화된 사실성 정렬을 위한 일반화 가능한 접근법

Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

초록

Support