幻覚発症の最速検出: 遅延限界と学習型CUSUM統計量

要旨

トークン単位の幻覚検出器は、全トークンに対するAUCによって分類器として評価されるが、ストリーミングモニタはその反応時間、すなわち幻覚の発生から警報までの間に通過するトークン数によって判断される。本稿では、幻覚開始検出を最速変化検出問題として定式化する。潜在的な忠実状態/幻覚状態に関する一次マルコフモデルはRAGTruth上で検証され、このタスクを古典的な変化点理論の枠組みに位置づけ、偽警報率0.01においてローデンの検出遅延下界として約1.3トークンを与える。次に、因果的リカレントラベラーが学習された増分を持つCUSUMとして機能することを示す。整合した偽警報率において、線形なトークン単位のベースラインの31トークンに対し、11〜13トークンで検出を達成する。制御された分解により、この優位性の大部分は時間的蓄積ではなく、より優れたトークン単位のスコアに起因することが示される。ドンスカー・ヴァラダン型の情報率最適性定理は、残る桁違いのギャップを説明する。すなわち、学習されたスコアは特徴量が持つダイバージェンスのわずか1/4.5しか実現しておらず、この不足は再調整では除去できず、残りは有限ホライズン効果である。分類指標はこの遅延構造を隠蔽するが、逐次分析はそれを測定可能にする。

English

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment; at a matched false-alarm rate it detects in 11-13 tokens, against 31 for a linear per-token baseline, and a controlled decomposition attributes most of this advantage to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable