D^2-Monitor: 망설임 인식 라우팅을 통한 확산 LLM의 동적 안전 모니터링

초록

확산 대규모 언어 모델(D-LLM)이 자기회귀 대규모 언어 모델(AR-LLM)의 대안으로 등장했음에도 불구하고, D-LLM에 대한 안전성 모니터링은 거의 탐구되지 않고 있다. AR-LLM과 달리 D-LLM은 다단계 노이즈 제거 과정을 통해 텍스트를 생성하며, 이 과정에서 중간 은닉 표현이 노출된다. 이러한 표현은 표준 단일 단계 모니터링 설정에서는 확인할 수 없는 안전 관련 정보를 포함할 수 있다. 경량 프로브가 상시 모니터링에 적합하다는 점에 착안하여, 본 연구에서는 프로브가 어려움을 겪을 가능성이 높을 때 이를 가장 잘 나타내는 궤적 수준 신호를 분석한다. 가장 정보량이 많은 신호는 안전성 주저(safety hesitation), 즉 중간 은닉 상태가 프로브의 결정 경계 근처 좁은 범위 내에 반복적으로 위치하는 것임을 발견했다. D-LLM 궤적 내에서 이러한 주저 단계의 수는 프로브 실패를 효과적으로 예측하며, 이는 샘플 난이도의 대리 지표 역할을 한다. 이 분석을 바탕으로, 우리는 D-LLM을 위한 이중 수준 안전성 모니터인 D^2-Monitor를 제안한다. D^2-Monitor는 경량 프로브를 상시 모니터로 채택하여 주저를 추정하고 기본 분류를 동시에 수행한다. 주저 수준이 임계값을 초과하면, 더 높은 표현력과 계산 비용을 가진 프로브가 활성화된다. 이러한 동적 라우팅 메커니즘은 테스트 시간에 모니터링 자원을 효율적으로 할당한다. 4개의 D-LLM에 걸쳐 3개의 데이터셋(WildguardMix, ToxicChat, OpenAI-Moderation)에서 평가한 결과, D^2-Monitor는 0.85M 이하의 소형 파라미터 발자국으로 최첨단 성능을 달성했으며, 8개의 기준 모델 대비 효과성과 효율성 간 최상의 절충을 보였다.

English

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose D^2-Monitor, a bi-level safety monitor for D-LLMs. D^2-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, D^2-Monitor achieves state-of-the-art performance with a compact parameter footprint (leq 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.