激活预言器的置信度与校准：用于可靠解释语言模型内部机制

摘要

激活神谕旨在让其他模型的激活模式对人类更易理解，相较于白盒可解释性技术展现出更优的结果。然而，针对此类激活神谕自然语言输出的不确定性量化（UQ）目前研究尚不充分。本文研究了6种不同的激活神谕置信度估计方法，并评估了其置信度分数的校准程度。通过在每个神谕上使用6,000个样本（变化口头表达与上下文提示）进行的实验表明：自举模式频率是测试中校准效果最佳的方法（在Qwen3-8B上，预期校准误差为5.7%，而答案词对数概率的误差为25.5%；在Qwen3.6-27B上，误差为10.3%，而后者为13.1%），且对数概率基线能以极低的成本作为快速分诊信号。代码及修补后的训练器见https://github.com/federicotorrielli/probabilistic_activation_oracles。

English

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.