ChatPaper.aiChatPaper

激活預言器之信心與校準:以可靠解釋語言模型內部機制

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

May 25, 2026
作者: Federico Torrielli, Peter Schneider-Kamp, Lukas Galke Poech
cs.AI

摘要

激活预言机旨在讓其他模型的激活對人類具有可讀性,相較於白箱可解釋性技術展現出有希望的結果。然而,對此類激活預言機自然語言輸出的不確定性量化(UQ)目前尚缺乏充分研究。本研究探討了6種不同的方法來估計激活預言機的置信度,並評估其置信度分數的校準程度。我們在每個預言機的6,000個樣本上進行實驗(改變言語器與上下文提示),結果顯示,在測試的方法中,bootstrap模式頻率是校準效果最好的方法(在Qwen3-8B上,ECE為5.7%,而答案詞對數概率的ECE為25.5%;在Qwen3.6-27B上,前者為10.3%,後者為13.1%),並且對數概率基線能以極低的成本作為快速分診信號。程式碼與修補後的訓練器可在 https://github.com/federicotorrielli/probabilistic_activation_oracles 取得。
English
Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.