Unilogit:基于统一目标自蒸馏的大语言模型鲁棒性机器遗忘方法
Unilogit: Robust Machine Unlearning for LLMs Using Uniform-Target Self-Distillation
May 9, 2025
作者: Stefan Vasilev, Christian Herold, Baohao Liao, Seyyed Hadi Hashemi, Shahram Khadivi, Christof Monz
cs.AI
摘要
本文介绍了Unilogit,一种用于大语言模型机器遗忘的新型自蒸馏方法。Unilogit解决了在保持模型整体效用的同时选择性遗忘特定信息的挑战,这是遵守GDPR等数据隐私法规的关键任务。与依赖静态超参数或初始模型输出的现有方法不同,Unilogit动态调整目标logits,以实现目标token的均匀概率分布,利用当前模型输出来获得更精确的自蒸馏目标。这种方法不仅消除了对额外超参数的需求,还增强了模型逼近理想目标的能力。在公开基准和内部电商数据集上的大量实验表明,Unilogit在平衡遗忘与保留目标方面表现出色,超越了NPO和UnDIAL等最先进方法。我们的分析进一步揭示了Unilogit在各种场景下的鲁棒性,突显了其在实现高效机器遗忘方面的实际适用性和有效性。
English
This paper introduces Unilogit, a novel self-distillation method for machine
unlearning in Large Language Models. Unilogit addresses the challenge of
selectively forgetting specific information while maintaining overall model
utility, a critical task in compliance with data privacy regulations like GDPR.
Unlike prior methods that rely on static hyperparameters or starting model
outputs, Unilogit dynamically adjusts target logits to achieve a uniform
probability for the target token, leveraging the current model's outputs for
more accurate self-distillation targets. This approach not only eliminates the
need for additional hyperparameters but also enhances the model's ability to
approximate the golden targets. Extensive experiments on public benchmarks and
an in-house e-commerce dataset demonstrate Unilogit's superior performance in
balancing forget and retain objectives, outperforming state-of-the-art methods
such as NPO and UnDIAL. Our analysis further reveals Unilogit's robustness
across various scenarios, highlighting its practical applicability and
effectiveness in achieving efficacious machine unlearning.