SoK: 大規模言語モデルに対するジェイルブレイクガードレールの評価

要旨

大規模言語モデル（LLMs）は目覚ましい進歩を遂げてきたが、その展開においては、特にセーフティメカニズムを回避するジャイルブレイク攻撃に対する重大な脆弱性が明らかになっている。ガードレール——LLMのインタラクションを監視・制御する外部防御メカニズム——は、有望な解決策として登場した。しかし、現在のLLMガードレールの状況は断片的であり、統一された分類体系と包括的な評価フレームワークが欠如している。本Systematization of Knowledge（SoK）論文では、LLM向けジャイルブレイクガードレールの初の包括的分析を提示する。我々は、6つの主要な次元に沿ってガードレールを分類する新規の多次元分類体系を提案し、その実用的な有効性を評価するためのセキュリティ・効率性・有用性評価フレームワークを導入する。広範な分析と実験を通じて、既存のガードレールアプローチの強みと限界を特定し、攻撃タイプ間での普遍性を探り、防御の組み合わせを最適化するための洞察を提供する。本研究は、将来の研究開発のための構造化された基盤を提供し、堅牢なLLMガードレールの原則に基づいた進展と展開を導くことを目指している。コードはhttps://github.com/xunguangwang/SoK4JailbreakGuardrailsで公開されている。

English

Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety mechanisms. Guardrails--external defense mechanisms that monitor and control LLM interaction--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, explore their universality across attack types, and provide insights into optimizing defense combinations. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at https://github.com/xunguangwang/SoK4JailbreakGuardrails.

SoK: 大規模言語モデルに対するジェイルブレイクガードレールの評価

SoK: Evaluating Jailbreak Guardrails for Large Language Models

要旨

Support