D²-Monitor: Dynamische Veiligheidsmonitoring voor Diffusie-LLMs via Aarzelingbewuste Routering

Samenvatting

Ondanks de opkomst van diffusion large taalmodellen (D-LLM's) als alternatief voor autoregressieve large taalmodellen (AR-LLM's), blijft veiligheidsmonitoring voor D-LLM's grotendeels onontgonnen. In tegenstelling tot AR-LLM's genereren D-LLM's tekst via een meerstaps-denoisingproces, waarbij tussentijdse verborgen representaties worden blootgelegd die veiligheidsrelevante informatie kunnen bevatten die niet beschikbaar is in standaard éénstapsmonitoringsopstellingen. Gedreven door de geschiktheid van lichtgewicht probes voor continue monitoring, analyseren we welke trajectniveausignalen het beste aangeven wanneer dergelijke probes waarschijnlijk moeite zullen hebben. We ontdekken dat het meest informatieve signaal veiligheidsaarzeling is: tussentijdse verborgen toestanden die herhaaldelijk binnen een kleine marge van de beslissingsgrens van de probe vallen. Het aantal van dergelijke aarzelingstappen in het traject van de D-LLM voorspelt effectief probe-falen en biedt een proxy voor monstermoeilijkheid. Voortbouwend op deze analyse stellen we D^2-Monitor voor, een tweeledige veiligheidsmonitor voor D-LLM's. D^2-Monitor maakt gebruik van een lichtgewicht probe als continue monitor om gezamenlijk aarzeling te schatten en basisclassificatie uit te voeren. Wanneer het aarzelingniveau een drempel overschrijdt, wordt een expressievere maar rekenintensievere probe geactiveerd. Dit dynamische routeringsmechanisme wijst monitoringbronnen efficiënt toe tijdens testtijd. Geëvalueerd op 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) over 4 D-LLM's, behaalt D^2-Monitor state-of-the-art prestaties met een compacte parameteromvang (≤ 0,85M parameters) en vertoont het de beste afweging tussen effectiviteit en efficiëntie ten opzichte van 8 basislijnen.

English

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose D^2-Monitor, a bi-level safety monitor for D-LLMs. D^2-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, D^2-Monitor achieves state-of-the-art performance with a compact parameter footprint (leq 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.