以假乱真教学法:基于合成负样本的课程式DPO用于幻觉检测
Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection
May 23, 2025
作者: Shrey Pandit, Ashwin Vinod, Liu Leqi, Ying Ding
cs.AI
摘要
由于幻觉文本的复杂性,使大型语言模型(LLMs)准确检测幻觉仍是一项重大挑战。认识到幻觉样本通常比传统负样本具有更高的欺骗性质量,我们在DPO对齐过程中将这些精心设计的幻觉作为负例使用。我们的方法融入了课程学习策略,逐步从基于独立事实核查模型概率得分最大降幅识别的较易样本,过渡到逐渐更难的样本。这种结构化的难度分级确保了稳定且渐进的学习。实验评估表明,采用课程DPO方法和高质量负样本训练的HaluCheck模型,在各项指标上显著提升了模型性能,在MedHallu和HaluEval等困难基准测试中实现了高达24%的改进。此外,HaluCheck模型在零样本设置下展现出鲁棒性,在多个基准测试中显著优于更大的最先进模型。
English
Aligning large language models (LLMs) to accurately detect hallucinations
remains a significant challenge due to the sophisticated nature of hallucinated
text. Recognizing that hallucinated samples typically exhibit higher deceptive
quality than traditional negative samples, we use these carefully engineered
hallucinations as negative examples in the DPO alignment procedure. Our method
incorporates a curriculum learning strategy, gradually transitioning the
training from easier samples, identified based on the greatest reduction in
probability scores from independent fact checking models, to progressively
harder ones. This structured difficulty scaling ensures stable and incremental
learning. Experimental evaluation demonstrates that our HaluCheck models,
trained with curriculum DPO approach and high quality negative samples,
significantly improves model performance across various metrics, achieving
improvements of upto 24% on difficult benchmarks like MedHallu and HaluEval.
Additionally, HaluCheck models demonstrate robustness in zero-shot settings,
significantly outperforming larger state-of-the-art models across various
benchmarks.Summary
AI-Generated Summary