針對大型語言模型的隱蔽越獄攻擊
Imperceptible Jailbreaking against Large Language Models
October 6, 2025
作者: Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang
cs.AI
摘要
針對視覺模態的越獄攻擊通常依賴於難以察覺的對抗性擾動,而對文本模態的攻擊則普遍被認為需要可見的修改(例如,非語義後綴)。在本文中,我們引入了一種利用一類稱為變體選擇器的Unicode字符實現的隱形越獄技術。通過將不可見的變體選擇器附加到惡意問題上,越獄提示在屏幕上與原始惡意問題視覺上完全相同,但其分詞過程卻被“秘密”改變。我們提出了一種搜索鏈管道來生成此類對抗性後綴,以誘導有害回應。實驗表明,我們的隱形越獄技術在對抗四種對齊的大型語言模型時達到了高攻擊成功率,並且能推廣到提示注入攻擊,所有這些都不會在書面提示中產生任何可見的修改。我們的代碼可在https://github.com/sail-sg/imperceptible-jailbreaks獲取。
English
Jailbreaking attacks on the vision modality typically rely on imperceptible
adversarial perturbations, whereas attacks on the textual modality are
generally assumed to require visible modifications (e.g., non-semantic
suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a
class of Unicode characters called variation selectors. By appending invisible
variation selectors to malicious questions, the jailbreak prompts appear
visually identical to original malicious questions on screen, while their
tokenization is "secretly" altered. We propose a chain-of-search pipeline to
generate such adversarial suffixes to induce harmful responses. Our experiments
show that our imperceptible jailbreaks achieve high attack success rates
against four aligned LLMs and generalize to prompt injection attacks, all
without producing any visible modifications in the written prompt. Our code is
available at https://github.com/sail-sg/imperceptible-jailbreaks.