D^2-Monitor: 躊躇認識ルーティングによる拡散LLMの動的安全性監視

要旨

拡散型大規模言語モデル（D-LLM）が自己回帰型大規模言語モデル（AR-LLM）の代替として登場したにもかかわらず、D-LLMに対する安全性監視はほとんど未開拓のままである。AR-LLMとは異なり、D-LLMは多段階のノイズ除去プロセスを通じてテキストを生成し、標準的な単一段階の監視設定では利用できない安全性関連情報を含む可能性のある中間隠れ表現を露出させる。常時監視に適した軽量プローブの利点に動機づけられ、我々はどの軌跡レベルの信号が、そのようなプローブが困難に直面する可能性を最もよく示すかを分析する。その結果、最も情報量の多い信号は安全性のためらい、すなわち中間隠れ状態がプローブの決定境界の僅かなマージン内に繰り返し収まることであることがわかった。D-LLMの軌跡におけるそのようなためらいステップの数は、プローブの失敗を効果的に予測し、サンプルの難易度の代理指標を提供する。この分析に基づき、我々はD-LLM向けの二段階安全監視機構であるD^2-Monitorを提案する。D^2-Monitorは、常時稼働の監視機構として軽量プローブを採用し、ためらいの推定と基本分類を共同で行う。ためらいのレベルが閾値を超えると、より表現力が高いが計算負荷の大きいプローブが起動される。この動的ルーティング機構により、テスト時に監視リソースを効率的に配分できる。3つのデータセット（WildguardMix、ToxicChat、OpenAI-Moderation）において4種類のD-LLMで評価した結果、D^2-Monitorはコンパクトなパラメータ規模（0.85Mパラメータ以下）で最先端の性能を達成し、有効性と効率性の間で最良のトレードオフを示した（8つのベースラインと比較して）。

English

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose D^2-Monitor, a bi-level safety monitor for D-LLMs. D^2-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, D^2-Monitor achieves state-of-the-art performance with a compact parameter footprint (leq 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.