대규모 언어 모델에 대한 지각 불가능한 탈옥 공격

초록

비전 모달리티에 대한 탈옥(jailbreaking) 공격은 일반적으로 지각할 수 없는 적대적 섭동에 의존하는 반면, 텍스트 모달리티에 대한 공격은 일반적으로 가시적인 수정(예: 비의미적 접미사)이 필요하다고 가정됩니다. 본 논문에서는 변형 선택자(variation selector)라는 유니코드 문자 클래스를 활용한 지각할 수 없는 탈옥 기법을 소개합니다. 악성 질문에 보이지 않는 변형 선택자를 추가함으로써, 탈옥 프롬프트는 화면상에서 원본 악성 질문과 시각적으로 동일하게 보이지만, 토큰화는 "비밀스럽게" 변경됩니다. 우리는 유해한 응답을 유도하기 위해 이러한 적대적 접미사를 생성하는 검색 연쇄(chain-of-search) 파이프라인을 제안합니다. 실험 결과, 우리의 지각할 수 없는 탈옥 기법은 네 가지 정렬된 대형 언어 모델(LLM)에 대해 높은 공격 성공률을 달성하며, 프롬프트 주입 공격으로도 일반화될 수 있음을 보여줍니다. 이 모든 과정에서 작성된 프롬프트에는 어떠한 가시적인 수정도 발생하지 않습니다. 우리의 코드는 https://github.com/sail-sg/imperceptible-jailbreaks에서 확인할 수 있습니다.

English

Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.

대규모 언어 모델에 대한 지각 불가능한 탈옥 공격

Imperceptible Jailbreaking against Large Language Models

초록

Support