大型推理模型的隱藏風險:R1模型的安全性評估
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
February 18, 2025
作者: Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, Xin Eric Wang
cs.AI
摘要
大型推理模型的快速發展,例如OpenAI-o3和DeepSeek-R1,已顯著提升了在複雜推理任務上相較於非推理大型語言模型(LLMs)的表現。然而,這些模型增強的能力,加上如DeepSeek-R1等模型的開源特性,引發了嚴重的安全顧慮,尤其是關於其可能被濫用的風險。在本研究中,我們對這些推理模型進行了全面的安全評估,利用既有的安全基準來檢驗其是否符合安全規範。此外,我們探討了它們對抗性攻擊(如越獄和提示注入)的易感性,以評估其在實際應用中的穩健性。通過多方面的分析,我們揭示了四項關鍵發現:(1)開源R1模型與o3-mini模型在安全基準和攻擊測試上存在顯著的安全差距,表明R1模型需要更多的安全改進。(2)蒸餾後的推理模型在安全性能上遜色於其經過安全對齊的基礎模型。(3)模型的推理能力越強,回答不安全問題時可能造成的危害越大。(4)R1模型的思考過程比其最終答案帶來更大的安全隱患。我們的研究為推理模型的安全影響提供了洞見,並強調了進一步提升R1模型安全性的必要性,以縮小這一差距。
English
The rapid development of large reasoning models, such as OpenAI-o3 and
DeepSeek-R1, has led to significant improvements in complex reasoning over
non-reasoning large language models~(LLMs). However, their enhanced
capabilities, combined with the open-source access of models like DeepSeek-R1,
raise serious safety concerns, particularly regarding their potential for
misuse. In this work, we present a comprehensive safety assessment of these
reasoning models, leveraging established safety benchmarks to evaluate their
compliance with safety regulations. Furthermore, we investigate their
susceptibility to adversarial attacks, such as jailbreaking and prompt
injection, to assess their robustness in real-world applications. Through our
multi-faceted analysis, we uncover four key findings: (1) There is a
significant safety gap between the open-source R1 models and the o3-mini model,
on both safety benchmark and attack, suggesting more safety effort on R1 is
needed. (2) The distilled reasoning model shows poorer safety performance
compared to its safety-aligned base models. (3) The stronger the model's
reasoning ability, the greater the potential harm it may cause when answering
unsafe questions. (4) The thinking process in R1 models pose greater safety
concerns than their final answers. Our study provides insights into the
security implications of reasoning models and highlights the need for further
advancements in R1 models' safety to close the gap.Summary
AI-Generated Summary