激活预言器的置信度与校准:用于可靠解释语言模型内部机制
Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals
May 25, 2026
作者: Federico Torrielli, Peter Schneider-Kamp, Lukas Galke Poech
cs.AI
摘要
激活神谕旨在让其他模型的激活模式对人类更易理解,相较于白盒可解释性技术展现出更优的结果。然而,针对此类激活神谕自然语言输出的不确定性量化(UQ)目前研究尚不充分。本文研究了6种不同的激活神谕置信度估计方法,并评估了其置信度分数的校准程度。通过在每个神谕上使用6,000个样本(变化口头表达与上下文提示)进行的实验表明:自举模式频率是测试中校准效果最佳的方法(在Qwen3-8B上,预期校准误差为5.7%,而答案词对数概率的误差为25.5%;在Qwen3.6-27B上,误差为10.3%,而后者为13.1%),且对数概率基线能以极低的成本作为快速分诊信号。
代码及修补后的训练器见https://github.com/federicotorrielli/probabilistic_activation_oracles。
English
Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost.
Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.