명시적으로 유해한 프롬프트를 사용한 상용 블랙박스 LLM의 Jailbreaking

초록

잘못된 프롬프트가 명백히 유해하지 않거나 유해한 출력을 유도하지 못할 경우, 탈옥 공격(jailbreak attack)을 평가하는 것은 어려운 과제입니다. 불행히도, 기존의 많은 레드 팀링(red-teaming) 데이터셋에는 이러한 부적합한 프롬프트가 포함되어 있습니다. 공격을 정확하게 평가하기 위해서는 이러한 데이터셋을 악의적 콘텐츠 여부에 따라 평가하고 정제해야 합니다. 그러나 기존의 악의적 콘텐츠 탐지 방법은 수동 주석 작업에 의존하거나, 대규모 언어 모델(LLM)을 사용하는데, 후자의 경우 유해 콘텐츠 유형에 대해 일관성 없는 정확도를 보입니다. 정확성과 효율성을 균형 있게 달성하기 위해, 우리는 LLM 기반 주석과 최소한의 인간 감독을 결합한 하이브리드 평가 프레임워크인 MDH(Malicious content Detection based on LLMs with Human assistance)를 제안하고, 이를 데이터셋 정제 및 탈옥 응답 탐지에 적용했습니다. 또한, 잘 구성된 개발자 메시지가 탈옥 성공률을 크게 높일 수 있다는 사실을 발견하여, 두 가지 새로운 전략을 제안합니다: 컨텍스트 시뮬레이션을 활용하는 D-Attack과 하이재킹된 사고의 연쇄(Chain of Thought)를 통합한 DH-CoT입니다. 코드, 데이터셋, 판단 결과 및 탐지 결과는 깃허브 저장소(https://github.com/AlienZhang1996/DH-CoT)에서 공개될 예정입니다.

English

Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.

명시적으로 유해한 프롬프트를 사용한 상용 블랙박스 LLM의 Jailbreaking

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

초록

Support