SoK:评估大型语言模型的越狱防护机制
SoK: Evaluating Jailbreak Guardrails for Large Language Models
June 12, 2025
作者: Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
cs.AI
摘要
大型语言模型(LLMs)已取得显著进展,但其部署过程中暴露了关键脆弱性,尤其是针对绕过安全机制的越狱攻击。护栏——作为监控和控制LLM交互的外部防御机制——已成为一种颇具前景的解决方案。然而,当前LLM护栏领域呈现碎片化状态,缺乏统一的分类体系和全面的评估框架。在本系统知识梳理(SoK)论文中,我们首次对LLM的越狱护栏进行了全面分析。我们提出了一种新颖的多维度分类法,从六个关键维度对护栏进行分类,并引入了一个安全-效率-实用性的评估框架,以衡量其实际效果。通过广泛的分析与实验,我们识别了现有护栏方法的优势与局限,探讨了它们在不同攻击类型中的普适性,并为优化防御组合提供了洞见。我们的工作为未来研究与开发奠定了结构化基础,旨在引导稳健LLM护栏的原则性进步与部署。代码已发布于https://github.com/xunguangwang/SoK4JailbreakGuardrails。
English
Large Language Models (LLMs) have achieved remarkable progress, but their
deployment has exposed critical vulnerabilities, particularly to jailbreak
attacks that circumvent safety mechanisms. Guardrails--external defense
mechanisms that monitor and control LLM interaction--have emerged as a
promising solution. However, the current landscape of LLM guardrails is
fragmented, lacking a unified taxonomy and comprehensive evaluation framework.
In this Systematization of Knowledge (SoK) paper, we present the first holistic
analysis of jailbreak guardrails for LLMs. We propose a novel,
multi-dimensional taxonomy that categorizes guardrails along six key
dimensions, and introduce a Security-Efficiency-Utility evaluation framework to
assess their practical effectiveness. Through extensive analysis and
experiments, we identify the strengths and limitations of existing guardrail
approaches, explore their universality across attack types, and provide
insights into optimizing defense combinations. Our work offers a structured
foundation for future research and development, aiming to guide the principled
advancement and deployment of robust LLM guardrails. The code is available at
https://github.com/xunguangwang/SoK4JailbreakGuardrails.