D^2-Monitor：基于犹豫感知路由的扩散大语言模型动态安全监控

摘要

尽管扩散大语言模型（D-LLMs）已作为自回归大语言模型（AR-LLMs）的替代方案出现，但针对D-LLMs的安全监测仍鲜有探索。与AR-LLMs不同，D-LLMs通过多步去噪过程生成文本，会暴露中间隐藏表示，这些表示可能包含标准单步监测设置中无法获取的安全相关信息。受轻量级探针适用于持续监控的启发，我们分析了哪些轨迹级信号最能指示此类探针可能失效的情况。我们发现最具信息量的信号是安全犹豫：中间隐藏状态反复落在探针决策边界的小范围内。D-LLM轨迹中此类犹豫步数能有效预测探针失败，从而为样本难度提供代理指标。基于这一分析，我们提出D²-Monitor，一种用于D-LLMs的双层安全监测器。D²-Monitor采用轻量级探针作为常开监测器，以联合估计犹豫程度并执行基础分类。当犹豫水平超过阈值时，会激活更具表现力但计算量更大的探针。这种动态路由机制在测试时高效分配监测资源。在4种D-LLMs上的3个数据集（WildguardMix、ToxicChat、OpenAI-Moderation）上进行评估，D²-Monitor以紧凑的参数规模（≤0.85M参数）实现了最先进性能，并且相对于8个基线方法展现出最佳的效果-效率权衡。

English

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose D^2-Monitor, a bi-level safety monitor for D-LLMs. D^2-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, D^2-Monitor achieves state-of-the-art performance with a compact parameter footprint (leq 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.