LLM에서 기만 탐지 프로브의 압력 테스트: 스케일링, 강건성, 그리고 기만적 표현의 기하학

초록

LLM 활성화에 대해 훈련된 선형 프로브는 점점 더 기만 탐지 지표로 제안되고 있지만, 깨끗한 벤치마크에서 0.96을 초과하는 AUROC를 보고하는 반면 분포 변화 하에서는 붕괴된다. 본 논문은 Gemma 3 모델군(1B-27B 파라미터)에서 프로브 기반 지표를 체계적으로 압력 테스트하여, 단순히 실패한다는 사실을 기록하는 대신 그 이유를 진단한다. 우리는 기만 인코딩에 관한 네 가지 가설을 검증한다: (1) 단일 선형 방향, (2) 다차원 부분공간, (3) 볼록 원뿔 껍질, (4) 엔트로피 프록시. 실험 설계에는 교차 도메인 전이 행렬, 순열 귀무 기준을 사용한 다차원 프로브 분석, 엔트로피 잔차화 테스트, 8가지 문체 변화에 걸친 방해 요소 평가가 포함된다. 주요 발견은 다음과 같다: (a) 프로브는 깨끗한 데이터에서 거의 완벽한 AUROC(>=0.998)를 달성하지만 문체 변화 하에서는 붕괴된다; 문체 보강 프로브는 보지 못한 문체에서 거의 완벽한 탐지(평균 AUROC 0.979-0.983)를 회복한다; (b) 단일 방향 가설은 기각된다(k=1은 AUROC 0.61-0.80만 포착), 교차 도메인 전이 실패는 기하학적 원인에 기인하며 계층 불일치에 의한 것이 아님을 확인한다; (c) 엔트로피 프록시 가설은 기각된다(최대 |rho|=0.454, 잔차화 후 최대 Delta-AUROC=0.004); (d) 기만은 유의미한 선형 부분공간을 형성하지 않지만(도메인별 k*=0), 다차원 프로브(k>=5)는 분산된 하한계 특징을 통해 신호를 회복한다. 프로브의 취약성은 구조적 한계보다는 분포적 협소성을 반영한다: 문체 보강 프로브는 4B와 27B 모두에서 거의 완벽한 탐지를 회복하며, 역스케일링 패턴이 실제 규모 의존 현상이 아니라 훈련 분포 인공물임을 입증한다.

English

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.