针对大型语言模型的隐形越狱攻击
Imperceptible Jailbreaking against Large Language Models
October 6, 2025
作者: Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang
cs.AI
摘要
针对视觉模态的越狱攻击通常依赖于难以察觉的对抗性扰动,而针对文本模态的攻击则普遍被认为需要可见的修改(例如,非语义后缀)。本文中,我们引入了一种利用一类称为变体选择符的Unicode字符实现的不可察觉越狱方法。通过在恶意问题后附加不可见的变体选择符,越狱提示在屏幕上看起来与原始恶意问题完全相同,但其分词过程却“秘密”地被改变。我们提出了一种链式搜索流程,用于生成此类对抗性后缀以诱导有害响应。实验表明,我们的不可察觉越狱方法在针对四种对齐的大型语言模型上取得了高攻击成功率,并能推广至提示注入攻击,且无需在书面提示中产生任何可见的修改。我们的代码可在https://github.com/sail-sg/imperceptible-jailbreaks获取。
English
Jailbreaking attacks on the vision modality typically rely on imperceptible
adversarial perturbations, whereas attacks on the textual modality are
generally assumed to require visible modifications (e.g., non-semantic
suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a
class of Unicode characters called variation selectors. By appending invisible
variation selectors to malicious questions, the jailbreak prompts appear
visually identical to original malicious questions on screen, while their
tokenization is "secretly" altered. We propose a chain-of-search pipeline to
generate such adversarial suffixes to induce harmful responses. Our experiments
show that our imperceptible jailbreaks achieve high attack success rates
against four aligned LLMs and generalize to prompt injection attacks, all
without producing any visible modifications in the written prompt. Our code is
available at https://github.com/sail-sg/imperceptible-jailbreaks.