如何提升大型推理模型的安全性:一项实证研究
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
May 21, 2025
作者: Zhexin Zhang, Xian Qi Loye, Victor Shea-Jay Huang, Junxiao Yang, Qi Zhu, Shiyao Cui, Fei Mi, Lifeng Shang, Yingkang Wang, Hongning Wang, Minlie Huang
cs.AI
摘要
大型推理模型(LRMs)在数学和编程等推理密集型任务上取得了显著成功。然而,其增强的推理能力并不必然转化为安全性能的提升——在某些情况下,甚至可能降低安全性。这引发了一个重要的研究问题:我们如何提升LRMs的安全性?本文通过监督微调(SFT)对增强LRMs安全性进行了全面的实证研究。我们的调查始于一个意外发现:直接从DeepSeek-R1中蒸馏安全响应未能显著提升安全性。我们分析了这一现象,并识别出导致此结果的三个关键失败模式。随后,我们证明在数据蒸馏过程中明确解决这些问题可以带来显著的安全改进。接着,我们探讨了实现安全性是否需要长而复杂的推理过程。有趣的是,我们发现仅使用简短或基于模板的推理过程即可达到相当的安全性能,并且模型学习这些过程比学习更复杂的推理链要容易得多。这些发现促使我们深入反思推理在确保安全性中的作用。最后,我们发现,在安全微调过程中混合数学推理数据有助于平衡安全性与过度拒绝。总体而言,我们希望我们的实证研究能为提升LRMs的安全性提供更全面的视角。实验所用的代码和数据已发布于https://github.com/thu-coai/LRM-Safety-Study。
English
Large Reasoning Models (LRMs) have achieved remarkable success on
reasoning-intensive tasks such as mathematics and programming. However, their
enhanced reasoning capabilities do not necessarily translate to improved safety
performance-and in some cases, may even degrade it. This raises an important
research question: how can we enhance the safety of LRMs? In this paper, we
present a comprehensive empirical study on how to enhance the safety of LRMs
through Supervised Fine-Tuning (SFT). Our investigation begins with an
unexpected observation: directly distilling safe responses from DeepSeek-R1
fails to significantly enhance safety. We analyze this phenomenon and identify
three key failure patterns that contribute to it. We then demonstrate that
explicitly addressing these issues during the data distillation process can
lead to substantial safety improvements. Next, we explore whether a long and
complex reasoning process is necessary for achieving safety. Interestingly, we
find that simply using short or template-based reasoning process can attain
comparable safety performance-and are significantly easier for models to learn
than more intricate reasoning chains. These findings prompt a deeper reflection
on the role of reasoning in ensuring safety. Finally, we find that mixing math
reasoning data during safety fine-tuning is helpful to balance safety and
over-refusal. Overall, we hope our empirical study could provide a more
holistic picture on enhancing the safety of LRMs. The code and data used in our
experiments are released in https://github.com/thu-coai/LRM-Safety-Study.Summary
AI-Generated Summary