Druktesten van misleidingssondes in grote taalmodellen: schaling, robuustheid en de geometrie van misleidende representaties

Samenvatting

Lineaire probes getraind op LLM-activaties worden steeds vaker voorgesteld als detectiemetrieken voor bedrog, maar rapporteren een AUROC van meer dan 0,96 op schone benchmarks terwijl ze instorten onder distributieverschuiving. Dit artikel onderwerpt probe-gebaseerde metrieken systematisch aan stresstests binnen de Gemma 3-modelfamilie (1B-27B parameters), waarbij we diagnosticeren waarom ze falen in plaats van alleen te documenteren dat ze falen. We testen vier hypothesen over de codering van bedrog: (1) enkele lineaire richting, (2) multidimensionale deelruimte, (3) convex conisch omhulsel, (4) entropieproxy. Ons ontwerp omvat cross-domein overdrachtsmatrices, multidimensionale probe-analyse met permutatie-nulhypothesen, entropie-residualisatietests en afleiderevaluaties over 8 stilistische verschuivingen. We vinden dat: (a) probes bijna perfecte AUROC (>=0,998) behalen op schone data, maar instorten onder stilistische verschuivingen; stijl-uitgebreide probes herstellen bijna perfecte detectie (gemiddelde AUROC 0,979-0,983) op ongeziene stijlen; (b) de enkele-richtinghypothese wordt verworpen (k=1 vangt slechts 0,61-0,80 AUROC), waarbij cross-domein overdrachtsfalen wordt bevestigd als geometrisch in plaats van laagafstemmingsgedreven; (c) de entropieproxy-hypothese wordt verworpen (max |rho|=0,454, max Delta-AUROC na residualisatie=0,004); en (d) bedrog vormt geen significante lineaire deelruimte (per-domein k*=0), maar multidimensionale probes (k>=5) herstellen het signaal via verdeelde subdrempelkenmerken. Probe-fragiliteit weerspiegelt distributienauwheid in plaats van een architecturale beperking: stijl-uitgebreide probes herstellen bijna perfecte detectie bij zowel 4B als 27B, wat vaststelt dat het inverse schalingspatroon een trainingsdistributie-artefact is in plaats van een echt schaalafhankelijk fenomeen.

English

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.