ChatPaper.aiChatPaper

RAPTOR:岭自适应逻辑探针

RAPTOR: Ridge-Adaptive Logistic Probes

January 29, 2026
作者: Ziqi Gao, Yaotian Zhu, Qingcheng Zeng, Xu Zhao, Ziqing Wang, Feng Ruan, Kaize Ding
cs.AI

摘要

探测研究通过训练轻量级预测器,探究冻结大语言模型各层表征中编码的信息。除分析功能外,探测技术常被应用于"探测-引导"操作流程:从探测器中提取学习得到的概念向量,通过前向传播过程中的加性激活引导将其注入到层级表征中。该流程的有效性取决于能否获得精确、在消融条件下方向稳定且计算成本低廉的概念向量。基于这些需求,我们提出RAPTOR(岭自适应逻辑探测器)——一种简单的L2正则化逻辑探测器,其通过验证集调优的岭强度从归一化权重中生成概念向量。在指令调优大语言模型和人工标注概念数据集上的大量实验表明,RAPTOR在准确度上达到或超越强基线方法,同时实现具有竞争力的方向稳定性并显著降低训练成本;这些量化结果得到了定性下游引导演示的佐证。最后,借助凸高斯极小极大定理(CGMT),我们在高维小样本场景下的理想化高斯师生模型中,对岭逻辑回归进行了机制性刻画,揭示了惩罚强度如何调节探测器精度与概念向量稳定性,并提出了与真实大语言模型嵌入趋势定性相符的结构性预测。
English
Probing studies what information is encoded in a frozen LLM's layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.
PDF73February 3, 2026