ChatPaper.aiChatPaper

顛覆基於推理的安全防護機制之技巧集錦

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

October 13, 2025
作者: Shuo Chen, Zhen Han, Haokun Chen, Bailan He, Shengyun Si, Jingpei Wu, Philip Torr, Volker Tresp, Jindong Gu
cs.AI

摘要

近期,针对大型推理模型(LRMs)的基于推理的安全防护措施,如深思熟虑的对齐策略,已展现出对越狱攻击的强大防御能力。这些防护措施通过利用LRMs的推理能力,帮助模型在生成最终响应前评估用户输入的安全性。强大的推理能力能够分析输入查询的意图,并在检测到越狱方法隐藏的有害意图时拒绝提供协助。此类防护措施在防御方面表现出显著提升,例如在开源gpt-oss系列上实现了近乎完美的拒绝率。然而,我们发现这些基于推理的强大防护措施极易受到输入提示的微妙操控,一旦被劫持,可能导致更为严重的后果。具体而言,我们首先揭示了这些防护措施的一个惊人脆弱点:仅需在输入提示中添加少量模板标记,即可成功绕过看似强大的防护措施,引发明确且有害的响应。为进一步探究,我们引入了一系列颠覆基于推理防护措施的越狱方法。我们的攻击涵盖白盒、灰盒和黑盒场景,从简单的模板操控到全自动优化不等。这些方法不仅具备可扩展实施的潜力,还在攻击成功率上达到了令人警觉的高度(例如,在本地主机模型和在线API服务上,gpt-oss系列在五个不同基准测试中的成功率均超过90%)。对多种领先开源LRMs的评估证实,这些漏洞具有系统性,凸显了加强开源LRMs对齐技术以防止恶意滥用的迫切需求。代码已开源,详见https://chenxshuo.github.io/bag-of-tricks。
English
Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs), such as deliberative alignment, have shown strong defense against jailbreak attacks. By leveraging LRMs' reasoning ability, these guardrails help the models to assess the safety of user inputs before generating final responses. The powerful reasoning ability can analyze the intention of the input query and will refuse to assist once it detects the harmful intent hidden by the jailbreak methods. Such guardrails have shown a significant boost in defense, such as the near-perfect refusal rates on the open-source gpt-oss series. Unfortunately, we find that these powerful reasoning-based guardrails can be extremely vulnerable to subtle manipulation of the input prompts, and once hijacked, can lead to even more harmful results. Specifically, we first uncover a surprisingly fragile aspect of these guardrails: simply adding a few template tokens to the input prompt can successfully bypass the seemingly powerful guardrails and lead to explicit and harmful responses. To explore further, we introduce a bag of jailbreak methods that subvert the reasoning-based guardrails. Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization. Along with the potential for scalable implementation, these methods also achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 different benchmarks on gpt-oss series on both local host models and online API services). Evaluations across various leading open-source LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignment techniques for open-sourced LRMs to prevent malicious misuse. Code is open-sourced at https://chenxshuo.github.io/bag-of-tricks.
PDF12October 15, 2025