面具背後的魔鬼:擴散式大型語言模型浮現的安全漏洞
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
July 15, 2025
作者: Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang
cs.AI
摘要
基於擴散的大型語言模型(dLLMs)近期作為自回歸LLMs的一種強大替代方案嶄露頭角,其通過並行解碼與雙向建模提供了更快的推理速度與更高的互動性。然而,儘管在代碼生成與文本填充方面表現出色,我們發現了一個根本性的安全隱患:現有的對齊機制未能有效保護dLLMs免受上下文感知、掩碼輸入的對抗性提示攻擊,從而暴露了新的脆弱性。為此,我們提出了DIJA,這是首個系統性研究並利用dLLMs獨特安全弱點的越獄攻擊框架。具體而言,DIJA構建了交錯掩碼文本的對抗性提示,這些提示利用了dLLMs的文本生成機制,即雙向建模與並行解碼。雙向建模促使模型為掩碼部分生成上下文一致的輸出,即便這些輸出有害;而並行解碼則限制了模型對不安全內容的動態過濾與拒絕採樣。這導致標準對齊機制失效,使得在對齊調優的dLLMs中,即便提示中直接暴露了有害行為或不安全指令,仍能生成有害的完成內容。通過全面實驗,我們證明DIJA顯著優於現有的越獄方法,揭示了dLLM架構中一個先前被忽視的威脅面。值得注意的是,我們的方法在Dream-Instruct上實現了高達100%的基於關鍵字的ASR,在JailbreakBench上以評估者為基礎的ASR超越了先前最強基線ReNeLLM達78.5%,在StrongREJECT得分上提升了37.7分,且無需在越獄提示中重寫或隱藏有害內容。我們的研究結果強調了重新思考這一新興語言模型類別中安全對齊的迫切需求。代碼已公開於https://github.com/ZichenWen1/DIJA。
English
Diffusion-based large language models (dLLMs) have recently emerged as a
powerful alternative to autoregressive LLMs, offering faster inference and
greater interactivity via parallel decoding and bidirectional modeling.
However, despite strong performance in code generation and text infilling, we
identify a fundamental safety concern: existing alignment mechanisms fail to
safeguard dLLMs against context-aware, masked-input adversarial prompts,
exposing novel vulnerabilities. To this end, we present DIJA, the first
systematic study and jailbreak attack framework that exploits unique safety
weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial
interleaved mask-text prompts that exploit the text generation mechanisms of
dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional
modeling drives the model to produce contextually consistent outputs for masked
spans, even when harmful, while parallel decoding limits model dynamic
filtering and rejection sampling of unsafe content. This causes standard
alignment mechanisms to fail, enabling harmful completions in alignment-tuned
dLLMs, even when harmful behaviors or unsafe instructions are directly exposed
in the prompt. Through comprehensive experiments, we demonstrate that DIJA
significantly outperforms existing jailbreak methods, exposing a previously
overlooked threat surface in dLLM architectures. Notably, our method achieves
up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior
baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and
by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of
harmful content in the jailbreak prompt. Our findings underscore the urgent
need for rethinking safety alignment in this emerging class of language models.
Code is available at https://github.com/ZichenWen1/DIJA.