환각 발현의 최단 탐지: 지연 경계와 학습된 CUSUM 통계량

초록

토큰 수준 환각 탐지기는 모든 토큰에 대한 AUC를 통해 분류기로 평가되지만, 스트리밍 모니터는 환각 발생부터 경보까지 경과하는 토큰 수인 반응 시간으로 판단된다. 본 연구는 환각 발생 탐지를 최단 변화 감지 문제로 정식화한다. RAGTruth에서 검증된 잠재적 충실/환각 상태의 1차 마르코프 모델은 이 작업을 고전적 변화점 이론의 범주에 위치시키며, 오경보율 0.01에서 약 1.3 토큰의 Lorden 하한을 도출한다. 이후 인과적 순환 레이블러가 학습된 증분을 갖는 CUSUM으로 작동함을 보인다. 일치된 오경보율에서 이는 11-13 토큰 내에 탐지하는 반면, 선형 토큰별 기준선은 31 토큰이 소요되며, 통제된 분해를 통해 이 이점의 대부분이 시간적 축적보다는 더 나은 토큰별 점수에 기인함을 확인한다. Donsker-Varadhan 유형의 정보율 최적성 정리는 나머지 규모 차이를 설명한다. 학습된 점수는 특징이 전달하는 발산의 1/4.5만을 실현하며, 이 결손은 재보정으로 제거할 수 없고, 나머지는 유한 수평선 효과에 해당한다. 분류 지표는 이러한 지연 구조를 은폐하지만, 순차 분석은 이를 측정 가능하게 만든다.

English

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment; at a matched false-alarm rate it detects in 11-13 tokens, against 31 for a linear per-token baseline, and a controlled decomposition attributes most of this advantage to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable