RAPTOR: Ridge-Adaptieve Logistische Probes

Samenvatting

Probing onderzoekt welke informatie is gecodeerd in de bevroren laagrepresentaties van een LLM door een lichtgewicht voorspeller bovenop deze representaties te trainen. Naast analyse worden probes vaak operationeel gebruikt in probe-then-steer-pipelines: een aangeleerde conceptvector wordt uit een probe geëxtraheerd en geïnjecteerd via additieve activatiersturing door deze toe te voegen aan een laagrepresentatie tijdens de forward pass. De effectiviteit van deze pipeline hangt af van het schatten van conceptvectoren die accuraat, directioneel stabiel onder ablatie, en goedkoop te verkrijgen zijn. Gemotiveerd door deze wensen stellen we RAPTOR voor (Ridge-Adaptive Logistic Probe), een simpele L2-geregulariseerde logistische probe waarvan de op validatie afgestemde ridge-sterkte conceptvectoren oplevert uit genormaliseerde gewichten. In uitgebreide experimenten op instruction-getunde LLM's en door mensen geschreven conceptdatasets evenaart of overtreft RAPTOR sterke baselines in nauwkeurigheid, terwijl het competitieve directionele stabiliteit en aanzienlijk lagere trainingskosten bereikt; deze kwantitatieve resultaten worden ondersteund door kwalitatieve downstream-sturingdemonstraties. Ten slotte geven we, gebruikmakend van de Convex Gaussian Min-max Theorem (CGMT), een mechanistische karakterisering van ridge logistische regressie in een geïdealiseerd Gaussisch teacher-student-model in het hoogdimensionale few-shot-regime, waarbij we verklaren hoe strafsterkte de probe-nauwkeurigheid en conceptvectorstabiliteit bemiddelt en structurele voorspellingen oplevert die kwalitatief overeenkomen met trends waargenomen in echte LLM-embeddingen.

English

Probing studies what information is encoded in a frozen LLM's layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.

RAPTOR: Ridge-Adaptieve Logistische Probes

RAPTOR: Ridge-Adaptive Logistic Probes

Samenvatting

Support