仮面の裏に潜む悪魔：拡散モデルにおける新たな安全性の脆弱性

要旨

拡散ベースの大規模言語モデル（dLLM）は、最近、自己回帰型LLMの強力な代替として登場し、並列デコードと双方向モデリングにより、より高速な推論と高いインタラクティブ性を提供しています。しかし、コード生成やテキスト補完において優れた性能を発揮する一方で、根本的な安全性の問題が存在します。既存のアライメント機構は、文脈を意識したマスク入力の敵対的プロンプトに対してdLLMを保護することができず、新たな脆弱性を露呈しています。この問題に対処するため、我々はDIJAを提案します。DIJAは、dLLMの独特な安全性の弱点を利用した、初の体系的な研究およびジェイルブレイク攻撃フレームワークです。具体的には、DIJAは、dLLMのテキスト生成メカニズム、すなわち双方向モデリングと並列デコードを利用した、敵対的なマスクテキストプロンプトを構築します。双方向モデリングは、有害な場合でも、マスクされた範囲に対して文脈的に一貫した出力を生成するようモデルを駆動し、並列デコードは、モデルの動的フィルタリングや安全でないコンテンツの拒否サンプリングを制限します。これにより、標準的なアライメント機構が機能しなくなり、アライメント調整されたdLLMにおいて、プロンプトに有害な行動や安全でない指示が直接含まれている場合でも、有害な補完が可能になります。包括的な実験を通じて、DIJAが既存のジェイルブレイク手法を大幅に上回り、dLLMアーキテクチャにおけるこれまで見過ごされていた脅威の側面を明らかにすることを示します。特に、我々の手法は、Dream-InstructにおいてキーワードベースのASRで最大100%を達成し、JailbreakBenchにおいては、最も強力な既存のベースラインであるReNeLLMを、評価者ベースのASRで最大78.5%、StrongREJECTスコアで37.7ポイント上回りました。さらに、ジェイルブレイクプロンプトにおいて有害なコンテンツを書き換えたり隠したりする必要がありません。我々の研究結果は、この新興の言語モデルクラスにおける安全性アライメントの再考が急務であることを強調しています。コードはhttps://github.com/ZichenWen1/DIJAで公開されています。

English

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

仮面の裏に潜む悪魔：拡散モデルにおける新たな安全性の脆弱性

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

要旨

Support