我們應如何提升大型推理模型的安全性:一項實證研究
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
May 21, 2025
作者: Zhexin Zhang, Xian Qi Loye, Victor Shea-Jay Huang, Junxiao Yang, Qi Zhu, Shiyao Cui, Fei Mi, Lifeng Shang, Yingkang Wang, Hongning Wang, Minlie Huang
cs.AI
摘要
大型推理模型(LRMs)在數學和編程等推理密集型任務上取得了顯著成功。然而,其增強的推理能力並不一定轉化為安全性能的提升——在某些情況下,甚至可能降低安全性。這引發了一個重要的研究問題:我們如何提升LRMs的安全性?本文通過監督微調(SFT)對如何增強LRMs的安全性進行了全面的實證研究。我們的研究始於一個意外的觀察:直接從DeepSeek-R1蒸餾安全回應並未顯著提升安全性。我們分析了這一現象,並識別出導致此結果的三個關鍵失敗模式。隨後,我們證明在數據蒸餾過程中明確解決這些問題可以帶來顯著的安全改進。接著,我們探討了實現安全性是否需要長而複雜的推理過程。有趣的是,我們發現僅使用簡短或基於模板的推理過程即可達到相當的安全性能,且模型學習這些過程比學習更複雜的推理鏈要容易得多。這些發現促使我們對推理在確保安全性中的角色進行了更深入的反思。最後,我們發現,在安全微調過程中混合數學推理數據有助於平衡安全性和過度拒絕。總體而言,我們希望這項實證研究能為提升LRMs的安全性提供更全面的視角。實驗中使用的代碼和數據已發佈於https://github.com/thu-coai/LRM-Safety-Study。
English
Large Reasoning Models (LRMs) have achieved remarkable success on
reasoning-intensive tasks such as mathematics and programming. However, their
enhanced reasoning capabilities do not necessarily translate to improved safety
performance-and in some cases, may even degrade it. This raises an important
research question: how can we enhance the safety of LRMs? In this paper, we
present a comprehensive empirical study on how to enhance the safety of LRMs
through Supervised Fine-Tuning (SFT). Our investigation begins with an
unexpected observation: directly distilling safe responses from DeepSeek-R1
fails to significantly enhance safety. We analyze this phenomenon and identify
three key failure patterns that contribute to it. We then demonstrate that
explicitly addressing these issues during the data distillation process can
lead to substantial safety improvements. Next, we explore whether a long and
complex reasoning process is necessary for achieving safety. Interestingly, we
find that simply using short or template-based reasoning process can attain
comparable safety performance-and are significantly easier for models to learn
than more intricate reasoning chains. These findings prompt a deeper reflection
on the role of reasoning in ensuring safety. Finally, we find that mixing math
reasoning data during safety fine-tuning is helpful to balance safety and
over-refusal. Overall, we hope our empirical study could provide a more
holistic picture on enhancing the safety of LRMs. The code and data used in our
experiments are released in https://github.com/thu-coai/LRM-Safety-Study.Summary
AI-Generated Summary