SoK: 대형 언어 모델의 탈옥 방지 장치 평가

초록

대규모 언어 모델(LLMs)은 놀라운 발전을 이루었지만, 그 배포 과정에서 특히 안전 메커니즘을 우회하는 탈옥(jailbreak) 공격에 대한 취약성이 노출되었습니다. 이러한 문제를 해결하기 위해, LLM 상호작용을 모니터링하고 제어하는 외부 방어 메커니즘인 가드레일(guardrails)이 유망한 해결책으로 부상했습니다. 그러나 현재 LLM 가드레일 환경은 분열되어 있으며, 통일된 분류 체계와 포괄적인 평가 프레임워크가 부족한 상황입니다. 본 시스템화 지식(Systematization of Knowledge, SoK) 논문에서는 LLM을 위한 탈옥 가드레일에 대한 최초의 종합적 분석을 제시합니다. 우리는 여섯 가지 주요 차원을 따라 가드레일을 분류하는 새로운 다차원 분류 체계를 제안하고, 실질적인 효과를 평가하기 위한 보안-효율성-유용성(Security-Efficiency-Utility) 평가 프레임워크를 소개합니다. 광범위한 분석과 실험을 통해 기존 가드레일 접근법의 강점과 한계를 식별하고, 다양한 공격 유형에 대한 보편성을 탐구하며, 방어 조합을 최적화하기 위한 통찰을 제공합니다. 본 연구는 미래 연구 및 개발을 위한 구조화된 기반을 제공함으로써, 견고한 LLM 가드레일의 원칙적 발전과 배포를 안내하고자 합니다. 코드는 https://github.com/xunguangwang/SoK4JailbreakGuardrails에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety mechanisms. Guardrails--external defense mechanisms that monitor and control LLM interaction--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, explore their universality across attack types, and provide insights into optimizing defense combinations. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at https://github.com/xunguangwang/SoK4JailbreakGuardrails.

SoK: 대형 언어 모델의 탈옥 방지 장치 평가

SoK: Evaluating Jailbreak Guardrails for Large Language Models

초록

Support