AdvPrompter: 대규모 언어 모델을 위한 빠른 적응형 적대적 프롬프팅

초록

최근 대형 언어 모델(LLMs)이 놀라운 성과를 거두고 있지만, 특정 jailbreaking 공격에 취약하여 부적절하거나 유해한 콘텐츠를 생성할 수 있다. 수동적인 red-teaming은 이러한 jailbreaking을 유발하는 적대적 프롬프트를 찾는 것을 요구하는데, 예를 들어 주어진 지시에 접미사를 추가하는 방식으로 이루어지며, 이는 비효율적이고 시간이 많이 소요된다. 반면, 자동적인 적대적 프롬프트 생성은 종종 의미론적으로 무의미한 공격을 초래하며, 이는 perplexity 기반 필터에 의해 쉽게 탐지될 수 있고, TargetLLM의 그래디언트 정보를 필요로 하거나, 토큰 공간에서의 시간 소모적인 이산 최적화 과정으로 인해 확장성이 떨어진다. 본 논문에서는 AdvPrompter라는 또 다른 LLM을 사용하여 인간이 읽을 수 있는 적대적 프롬프트를 초 단위로 생성하는 새로운 방법을 제시하며, 이는 기존의 최적화 기반 접근법보다 약 800배 빠르다. 우리는 TargetLLM의 그래디언트에 접근할 필요가 없는 새로운 알고리즘을 사용하여 AdvPrompter를 학습시킨다. 이 과정은 두 단계를 번갈아가며 수행한다: (1) AdvPrompter 예측을 최적화하여 고품질의 목표 적대적 접미사를 생성하고, (2) 생성된 적대적 접미사를 사용하여 AdvPrompter를 저랭크 미세 조정한다. 학습된 AdvPrompter는 입력 지시의 의미를 변경하지 않으면서도 이를 은폐하는 접미사를 생성하여, TargetLLM이 유해한 응답을 하도록 유도한다. 인기 있는 오픈 소스 TargetLLMs에 대한 실험 결과는 AdvBench 데이터셋에서 최첨단 결과를 보여주며, 이는 폐쇄형 블랙박스 LLM API로도 전이된다. 또한, AdvPrompter에 의해 생성된 합성 데이터셋을 미세 조정함으로써 LLMs가 jailbreaking 공격에 대해 더 강력해지면서도 성능(즉, 높은 MMLU 점수)을 유지할 수 있음을 보여준다.

English

While recently Large Language Models (LLMs) have achieved remarkable successes, they are vulnerable to certain jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires finding adversarial prompts that cause such jailbreaking, e.g. by appending a suffix to a given instruction, which is inefficient and time-consuming. On the other hand, automatic adversarial prompt generation often leads to semantically meaningless attacks that can easily be detected by perplexity-based filters, may require gradient information from the TargetLLM, or do not scale well due to time-consuming discrete optimization processes over the token space. In this paper, we present a novel method that uses another LLM, called the AdvPrompter, to generate human-readable adversarial prompts in seconds, sim800times faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between two steps: (1) generating high-quality target adversarial suffixes by optimizing the AdvPrompter predictions, and (2) low-rank fine-tuning of the AdvPrompter with the generated adversarial suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, that also transfer to closed-source black-box LLM APIs. Further, we demonstrate that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores.

AdvPrompter: 대규모 언어 모델을 위한 빠른 적응형 적대적 프롬프팅

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

초록

Support