面具背后的恶魔：扩散大语言模型中的新兴安全漏洞

摘要

基于扩散的大型语言模型（dLLMs）近期作为自回归LLMs的强大替代方案崭露头角，凭借并行解码和双向建模提供了更快的推理速度和更高的交互性。然而，尽管在代码生成和文本填充方面表现出色，我们发现了一个根本性的安全隐患：现有的对齐机制未能有效保护dLLMs免受上下文感知、掩码输入对抗性提示的攻击，暴露了新的脆弱性。为此，我们提出了DIJA，这是首个系统性研究并针对dLLMs独特安全弱点的越狱攻击框架。具体而言，DIJA构建了对抗性的交错掩码文本提示，利用dLLMs的文本生成机制，即双向建模和并行解码。双向建模促使模型为掩码部分生成上下文一致的输出，即便内容有害；而并行解码则限制了模型对不安全内容的动态过滤和拒绝采样。这导致标准对齐机制失效，使得在对齐调优的dLLMs中，即便提示中直接暴露了有害行为或不安全指令，仍能生成有害的补全内容。通过全面实验，我们证明DIJA显著优于现有越狱方法，揭示了dLLM架构中一个先前被忽视的威胁面。值得注意的是，我们的方法在Dream-Instruct上实现了高达100%的关键词ASR，在JailbreakBench上以评估者ASR超越最强基线ReNeLLM达78.5%，并在StrongREJECT得分上高出37.7分，且无需在越狱提示中重写或隐藏有害内容。我们的发现强调了重新思考这一新兴语言模型类别安全对齐的迫切需求。代码已发布于https://github.com/ZichenWen1/DIJA。

English

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

面具背后的恶魔：扩散大语言模型中的新兴安全漏洞

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

摘要

Support