ChatPaper.aiChatPaper

RAPTOR:嶺適應性邏輯探針

RAPTOR: Ridge-Adaptive Logistic Probes

January 29, 2026
作者: Ziqi Gao, Yaotian Zhu, Qingcheng Zeng, Xu Zhao, Ziqing Wang, Feng Ruan, Kaize Ding
cs.AI

摘要

探針研究旨在通過在凍結大型語言模型(LLM)的層表徵上訓練輕量級預測器,來探析這些表徵中編碼了何種信息。除了分析用途外,探針常被操作性地應用於「探針後控制」流程:從探針中提取學習到的概念向量,並通過加法激活控制在前向傳播過程中將其注入層表徵。該流程的效能取決於能否估算出精確、在截除操作下方向穩定且獲取成本低廉的概念向量。基於這些需求,我們提出RAPTOR(嶺自適應邏輯探針),這是一種簡單的L2正則化邏輯探針,其通過驗證調優的嶺強度從歸一化權重中生成概念向量。在對指令微調LLM和人工撰寫概念數據集的大量實驗中,RAPTOR在準確度上達到或超越強基線,同時實現了競爭性的方向穩定性與顯著降低的訓練成本;這些定量結果得到了定性下游控制演示的佐證。最後,我們利用凸高斯最小最大定理(CGMT),在高維少樣本場景下的理想化高斯師生模型中,對嶺邏輯回歸進行機制性表徵,闡釋了懲罰強度如何調控探針準確度與概念向量穩定性,並得出與真實LLM嵌入趨勢定性吻合的結構性預測。
English
Probing studies what information is encoded in a frozen LLM's layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.
PDF73February 3, 2026