안전하다는 착각: '안전한' AI 응답에서 발생하는 정보 유출 위험

초록

대형 언어 모델(LLM)은 유해하거나 일반적으로 허용되지 않는 출력을 유도하는 방법인 '탈옥(jailbreak)'에 취약합니다. 안전 조치는 이러한 탈옥 공격을 방어하는 데 있어 그 효과를 기준으로 개발 및 평가되며, 이는 안전성이 견고성과 동일하다는 믿음을 반영합니다. 우리는 출력 필터 및 정렬 미세 조정과 같은 현재의 방어 메커니즘이 모델 안전성을 보장하기에 근본적으로 불충분하며, 앞으로도 그럴 것이라고 주장합니다. 이러한 방어 메커니즘은 이중 의도 쿼리와 무해한 출력을 조합하여 유해한 목표를 달성할 수 있는 능력에서 비롯된 위험을 해결하지 못합니다. 이 중요한 격차를 해결하기 위해, 우리는 모델 출력에서 허용되지 않는 정보 유출을 악용하여 악의적 목표를 달성하는 '추론적 적대자(inferential adversaries)'라는 정보 이론적 위협 모델을 소개합니다. 우리는 이를 특정 허용되지 않는 출력을 강제로 생성시키려는 일반적으로 연구되는 보안 적대자와 구별합니다. 우리는 질문 분해와 응답 집계를 통해 추론적 적대자를 자동화하는 것이 가능함을 입증합니다. 안전성을 보장하기 위해, 우리는 검열 메커니즘에 대한 정보 검열 기준을 정의하여 허용되지 않는 정보의 유출을 제한합니다. 우리는 이 한계를 보장하는 방어 메커니즘을 제안하고, 안전성과 유용성 간의 본질적인 트레이드오프를 밝힙니다. 우리의 연구는 안전한 LLM을 출시하기 위한 요구 사항과 관련된 유용성 비용에 대한 첫 번째 이론적 이해를 제공합니다.

English

Large Language Models (LLMs) are vulnerable to jailbreaksx2013methods to elicit harmful or generally impermissible outputs. Safety measures are developed and assessed on their effectiveness at defending against jailbreak attacks, indicating a belief that safety is equivalent to robustness. We assert that current defense mechanisms, such as output filters and alignment fine-tuning, are, and will remain, fundamentally insufficient for ensuring model safety. These defenses fail to address risks arising from dual-intent queries and the ability to composite innocuous outputs to achieve harmful goals. To address this critical gap, we introduce an information-theoretic threat model called inferential adversaries who exploit impermissible information leakage from model outputs to achieve malicious goals. We distinguish these from commonly studied security adversaries who only seek to force victim models to generate specific impermissible outputs. We demonstrate the feasibility of automating inferential adversaries through question decomposition and response aggregation. To provide safety guarantees, we define an information censorship criterion for censorship mechanisms, bounding the leakage of impermissible information. We propose a defense mechanism which ensures this bound and reveal an intrinsic safety-utility trade-off. Our work provides the first theoretically grounded understanding of the requirements for releasing safe LLMs and the utility costs involved.

안전하다는 착각: '안전한' AI 응답에서 발생하는 정보 유출 위험

A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses

초록

Support