激活預言器之信心與校準：以可靠解釋語言模型內部機制

摘要

激活预言机旨在讓其他模型的激活對人類具有可讀性，相較於白箱可解釋性技術展現出有希望的結果。然而，對此類激活預言機自然語言輸出的不確定性量化（UQ）目前尚缺乏充分研究。本研究探討了6種不同的方法來估計激活預言機的置信度，並評估其置信度分數的校準程度。我們在每個預言機的6,000個樣本上進行實驗（改變言語器與上下文提示），結果顯示，在測試的方法中，bootstrap模式頻率是校準效果最好的方法（在Qwen3-8B上，ECE為5.7%，而答案詞對數概率的ECE為25.5%；在Qwen3.6-27B上，前者為10.3%，後者為13.1%），並且對數概率基線能以極低的成本作為快速分診信號。程式碼與修補後的訓練器可在 https://github.com/federicotorrielli/probabilistic_activation_oracles 取得。

English

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.