LLMにおける欺瞞プローブのストレステスト：スケーリング、ロバスト性、そして欺瞞的表象の幾何学

要旨

LLMの活性化に基づいて訓練された線形プローブは、欺瞞検出の指標として提案されることが増えているが、クリーンベンチマークでは0.96を超えるAUROCを報告する一方、分布シフトの下で性能が崩壊する。本論文は、Gemma 3モデルファミリー（1B～27Bパラメータ）にわたってプローブベースの指標を体系的にストレステストし、単に失敗を記録するのではなく、その失敗の原因を診断する。我々は、欺瞞の符号化に関する4つの仮説を検証する：(1) 単一線形方向、(2) 多次元部分空間、(3) 凸円錐包、(4) エントロピープロキシ。我々の実験設計には、クロスドメイン転送行列、置換帰無仮説ベースラインを用いた多次元プローブ分析、エントロピー残差化テスト、および8つのスタイルシフトにわたるディストラクター評価が含まれる。我々は以下の知見を得た：(a) プローブはクリーンデータではほぼ完全なAUROC（>=0.998）を達成するが、スタイルシフトによって性能が崩壊する。スタイル拡張プローブは未見のスタイルに対してほぼ完全な検出（平均AUROC 0.979-0.983）を回復する。(b) 単一方向仮説は棄却される（k=1ではAUROC 0.61-0.80のみ捕捉）。クロスドメイン転送の失敗は、層の不一致ではなく幾何学的要因によるものであることが確認された。(c) エントロピープロキシ仮説は棄却される（最大|ρ|=0.454、残差化後の最大Δ-AUROC=0.004）。(d) 欺瞞は有意な線形部分空間を形成しない（ドメインごとの最適次元k*=0）が、多次元プローブ（k>=5）は分布した閾値未満の特徴を通じて信号を回復する。プローブの脆弱性は、アーキテクチャ上の制限ではなく、分布の狭さを反映している。スタイル拡張プローブは4Bと27Bの両方でほぼ完全な検出を回復し、逆スケーリングパターンが真のスケール依存現象ではなく、学習分布の人為的産物であることを示している。

English

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.