压力测试大型语言模型中的欺骗探针：规模性、鲁棒性与欺骗表示的几何结构

摘要

训练于大语言模型激活值上的线性探针日益被提出用作欺骗检测指标，但在干净基准测试上AUROC超过0.96的同时，却会在分布偏移下崩溃。本文系统性地对Gemma 3模型系列（1B-27B参数）中基于探针的指标进行压力测试，诊断其失败原因而非仅记录失败现象。我们检验了关于欺骗编码的四种假设：（1）单一线性方向，（2）多维子空间，（3）凸锥形包络，（4）熵代理。实验设计包含跨领域迁移矩阵、基于置换零假设的多维探针分析、熵残差化测试，以及在8种风格偏移下的干扰项评估。主要发现包括：（a）探针在干净数据上接近完美AUROC（>=0.998），但在风格偏移下崩溃；风格增强型探针在未见风格上恢复近完美检测（平均AUROC 0.979-0.983）；（b）单一方向假设被拒绝（k=1仅捕获0.61-0.80 AUROC），跨领域迁移失败被确认为几何问题而非层不匹配导致；（c）熵代理假设被拒绝（最大|ρ|=0.454，残差化后最大Δ-AUROC=0.004）；（d）欺骗信号未形成显著线性子空间（各领域k*=0），但多维探针（k≥5）通过分布式的亚阈值特征恢复信号。探针的脆弱性反映的是分布狭窄性而非架构限制：风格增强型探针在4B和27B参数规模下均恢复近完美检测，证明逆缩放模式是训练分布的人工产物而非真正的规模依赖现象。

English

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.