SoK:大型語言模型越獄防護機制之評估
SoK: Evaluating Jailbreak Guardrails for Large Language Models
June 12, 2025
作者: Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
cs.AI
摘要
大型語言模型(LLMs)已取得顯著進展,但其部署過程暴露了關鍵的脆弱性,尤其是針對繞過安全機制的越獄攻擊。防護欄——一種監控和控制LLM互動的外部防禦機制——已成為一種頗具前景的解決方案。然而,當前LLM防護欄的格局較為分散,缺乏統一的分類體系和全面的評估框架。在本系統化知識(SoK)論文中,我們首次對LLM的越獄防護欄進行了全面分析。我們提出了一種新穎的多維分類法,沿六個關鍵維度對防護欄進行分類,並引入了一個安全-效率-實用性評估框架來評估其實際效果。通過廣泛的分析和實驗,我們識別了現有防護欄方法的優勢與局限,探討了它們在不同攻擊類型中的普適性,並提供了優化防禦組合的見解。我們的工作為未來的研究與開發提供了結構化的基礎,旨在指導穩健LLM防護欄的原則性進步與部署。代碼可在https://github.com/xunguangwang/SoK4JailbreakGuardrails獲取。
English
Large Language Models (LLMs) have achieved remarkable progress, but their
deployment has exposed critical vulnerabilities, particularly to jailbreak
attacks that circumvent safety mechanisms. Guardrails--external defense
mechanisms that monitor and control LLM interaction--have emerged as a
promising solution. However, the current landscape of LLM guardrails is
fragmented, lacking a unified taxonomy and comprehensive evaluation framework.
In this Systematization of Knowledge (SoK) paper, we present the first holistic
analysis of jailbreak guardrails for LLMs. We propose a novel,
multi-dimensional taxonomy that categorizes guardrails along six key
dimensions, and introduce a Security-Efficiency-Utility evaluation framework to
assess their practical effectiveness. Through extensive analysis and
experiments, we identify the strengths and limitations of existing guardrail
approaches, explore their universality across attack types, and provide
insights into optimizing defense combinations. Our work offers a structured
foundation for future research and development, aiming to guide the principled
advancement and deployment of robust LLM guardrails. The code is available at
https://github.com/xunguangwang/SoK4JailbreakGuardrails.