D^2-Monitor:基于犹豫感知路由的扩散大语言模型动态安全监控
D^2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
May 25, 2026
作者: Aoxi Liu, Yupeng Chen, James Oldfield, Guanzhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi
cs.AI
摘要
尽管扩散大语言模型(D-LLMs)已作为自回归大语言模型(AR-LLMs)的替代方案出现,但针对D-LLMs的安全监测仍鲜有探索。与AR-LLMs不同,D-LLMs通过多步去噪过程生成文本,会暴露中间隐藏表示,这些表示可能包含标准单步监测设置中无法获取的安全相关信息。受轻量级探针适用于持续监控的启发,我们分析了哪些轨迹级信号最能指示此类探针可能失效的情况。我们发现最具信息量的信号是安全犹豫:中间隐藏状态反复落在探针决策边界的小范围内。D-LLM轨迹中此类犹豫步数能有效预测探针失败,从而为样本难度提供代理指标。基于这一分析,我们提出D²-Monitor,一种用于D-LLMs的双层安全监测器。D²-Monitor采用轻量级探针作为常开监测器,以联合估计犹豫程度并执行基础分类。当犹豫水平超过阈值时,会激活更具表现力但计算量更大的探针。这种动态路由机制在测试时高效分配监测资源。在4种D-LLMs上的3个数据集(WildguardMix、ToxicChat、OpenAI-Moderation)上进行评估,D²-Monitor以紧凑的参数规模(≤0.85M参数)实现了最先进性能,并且相对于8个基线方法展现出最佳的效果-效率权衡。
English
Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose D^2-Monitor, a bi-level safety monitor for D-LLMs. D^2-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, D^2-Monitor achieves state-of-the-art performance with a compact parameter footprint (leq 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.