ChatPaper.aiChatPaper

壓力測試大型語言模型中的欺騙探測:擴展性、穩健性及欺騙性表徵的幾何結構

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

May 27, 2026
作者: Sachin Kumar
cs.AI

摘要

線性探針基於大語言模型(LLM)激活值進行訓練,日益被提出作為欺騙檢測指標,其在乾淨基準測試上報告的AUROC超過0.96,但在分佈偏移下表現急遽崩潰。本文系統性地對Gemma 3系列模型(1B至27B參數)上的探針指標進行壓力測試,診斷其失效原因,而非僅僅記錄失效現象。我們針對欺騙編碼提出四項假設:(1)單一線性方向;(2)多維子空間;(3)凸錐包絡;(4)熵代理。實驗設計包含跨域轉移矩陣、結合隨機置換基線的多維探針分析、熵殘差化測試,以及涵蓋8種風格偏移的干擾評估。我們發現:(a)探針在乾淨數據上達到近乎完美的AUROC(≥0.998),但在風格偏移下急遽崩潰;經風格增強訓練的探針能在未見過的風格上恢復近乎完美的檢測(平均AUROC 0.979-0.983);(b)單一方向假設被拒絕(k=1僅能捕獲0.61-0.80的AUROC),且跨域轉移失敗被證實為幾何特性問題,而非層級錯配所驅動;(c)熵代理假設被拒絕(最大|ρ|=0.454,殘差化後最大Δ-AUROC=0.004);(d)欺騙行為並未形成顯著的線性子空間(各領域k*=0),然而多維探針(k≥5)能透過分散的亞閾值特徵恢復信號。探針的脆弱性反映的是分佈狹窄問題,而非架構限制:經風格增強訓練的探針在4B與27B模型上均能恢復近乎完美的檢測,這證實了反向縮放模式實為訓練分佈的人為產物,而非真正的規模相依現象。
English
Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.