D^2-監控器：基於猶豫感知路由的擴散大語言模型動態安全監控

摘要

儘管擴散式大型語言模型（D-LLMs）已成為自回歸大型語言模型（AR-LLMs）的替代方案，但針對D-LLMs的安全監控仍 largely 未經探索。不同於AR-LLMs，D-LLMs透過多步去噪過程生成文本，過程中暴露的中間隱藏表徵可能包含標準單步監控設置中無法取得的安相關資訊。基於輕量化探針適用於持續監控的特性，我們分析哪些軌跡層級訊號最能指示此類探針可能遭遇瓶頸。研究發現，最具資訊性的訊號是安全猶豫：中間隱藏狀態反覆落在探針決策邊界的小範圍內。D-LLM軌跡中此類猶豫步數能有效預測探針失效情況，成為樣本難度的代理指標。根據此分析，我們提出D²-Monitor，一種專為D-LLMs設計的雙層安全監控器。D²-Monitor採用輕量化探針作為持續監控器，同時進行猶豫估計與基礎分類。當猶豫程度超過門檻值時，會啟動更具表現力但計算負擔較重的探針。此動態路由機制可在測試時有效分配監控資源。在涵蓋4種D-LLMs的3個數據集（WildguardMix、ToxicChat、OpenAI-Moderation）上進行評估，D²-Monitor以緊湊的參數量（≤0.85M參數）達到當前最佳效能，並在有效性與效率之間取得相較於8個基線方法的最佳平衡。

English

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose D^2-Monitor, a bi-level safety monitor for D-LLMs. D^2-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, D^2-Monitor achieves state-of-the-art performance with a compact parameter footprint (leq 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.