針對大型語言模型的隱蔽越獄攻擊

摘要

針對視覺模態的越獄攻擊通常依賴於難以察覺的對抗性擾動，而對文本模態的攻擊則普遍被認為需要可見的修改（例如，非語義後綴）。在本文中，我們引入了一種利用一類稱為變體選擇器的Unicode字符實現的隱形越獄技術。通過將不可見的變體選擇器附加到惡意問題上，越獄提示在屏幕上與原始惡意問題視覺上完全相同，但其分詞過程卻被“秘密”改變。我們提出了一種搜索鏈管道來生成此類對抗性後綴，以誘導有害回應。實驗表明，我們的隱形越獄技術在對抗四種對齊的大型語言模型時達到了高攻擊成功率，並且能推廣到提示注入攻擊，所有這些都不會在書面提示中產生任何可見的修改。我們的代碼可在https://github.com/sail-sg/imperceptible-jailbreaks獲取。

English

Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.

針對大型語言模型的隱蔽越獄攻擊

Imperceptible Jailbreaking against Large Language Models

摘要

Support