微调为何会加剧模型幻觉及其应对策略

摘要

大型语言模型容易产生事实错误的幻觉陈述。这类错误的关键根源在于监督微调过程中接触新事实信息，可能导致模型对预训练阶段所学知识产生更多幻觉。本文从持续学习理论中汲取成熟工具，探索如何缓解监督微调引发的幻觉问题——这类幻觉本质上是训练过程中知识退化的副产品。我们提出一种基于自蒸馏的监督微调方法，通过正则化输出分布漂移，在实现有效事实学习的同时，最大限度减少对既有知识的幻觉。研究还表明，在无需获取新知识的场景下，通过冻结参数组来抑制事实可塑性，可在保持任务性能的同时降低幻觉。最后我们通过三个假说（容量限制、行为克隆和局部干扰）探究监督微调引发幻觉的机制。实验表明，语义表征重叠引发的干扰是主要诱因，而自蒸馏技术正是通过缓解这种干扰发挥作用。

English

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.