壓力測試大型語言模型中的欺騙探測：擴展性、穩健性及欺騙性表徵的幾何結構

摘要

線性探針基於大語言模型（LLM）激活值進行訓練，日益被提出作為欺騙檢測指標，其在乾淨基準測試上報告的AUROC超過0.96，但在分佈偏移下表現急遽崩潰。本文系統性地對Gemma 3系列模型（1B至27B參數）上的探針指標進行壓力測試，診斷其失效原因，而非僅僅記錄失效現象。我們針對欺騙編碼提出四項假設：（1）單一線性方向；（2）多維子空間；（3）凸錐包絡；（4）熵代理。實驗設計包含跨域轉移矩陣、結合隨機置換基線的多維探針分析、熵殘差化測試，以及涵蓋8種風格偏移的干擾評估。我們發現：（a）探針在乾淨數據上達到近乎完美的AUROC（≥0.998），但在風格偏移下急遽崩潰；經風格增強訓練的探針能在未見過的風格上恢復近乎完美的檢測（平均AUROC 0.979-0.983）；（b）單一方向假設被拒絕（k=1僅能捕獲0.61-0.80的AUROC），且跨域轉移失敗被證實為幾何特性問題，而非層級錯配所驅動；（c）熵代理假設被拒絕（最大|ρ|=0.454，殘差化後最大Δ-AUROC=0.004）；（d）欺騙行為並未形成顯著的線性子空間（各領域k*=0），然而多維探針（k≥5）能透過分散的亞閾值特徵恢復信號。探針的脆弱性反映的是分佈狹窄問題，而非架構限制：經風格增強訓練的探針在4B與27B模型上均能恢復近乎完美的檢測，這證實了反向縮放模式實為訓練分佈的人為產物，而非真正的規模相依現象。

English

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.