针对大型语言模型的隐形越狱攻击

摘要

针对视觉模态的越狱攻击通常依赖于难以察觉的对抗性扰动，而针对文本模态的攻击则普遍被认为需要可见的修改（例如，非语义后缀）。本文中，我们引入了一种利用一类称为变体选择符的Unicode字符实现的不可察觉越狱方法。通过在恶意问题后附加不可见的变体选择符，越狱提示在屏幕上看起来与原始恶意问题完全相同，但其分词过程却“秘密”地被改变。我们提出了一种链式搜索流程，用于生成此类对抗性后缀以诱导有害响应。实验表明，我们的不可察觉越狱方法在针对四种对齐的大型语言模型上取得了高攻击成功率，并能推广至提示注入攻击，且无需在书面提示中产生任何可见的修改。我们的代码可在https://github.com/sail-sg/imperceptible-jailbreaks获取。

English

Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.

针对大型语言模型的隐形越狱攻击

Imperceptible Jailbreaking against Large Language Models

摘要

Support