言語モデル内部の信頼性のある解釈のための活性化オラクルの信頼度とキャリブレーション

要旨

活性化オラクルは、他のモデルの活性化を人間にとって可読にすることを目的としており、ホワイトボックス解釈可能性手法と比較して有望な結果を示している。しかしながら、このような活性化オラクルの自然言語出力に対する不確実性定量化（UQ）はこれまで研究が不十分である。本研究では、活性化オラクルの信頼度を推定する6つの異なる手法を調査し、それらの信頼度スコアの較正がどの程度良好であるかを評価する。オラクルあたり6,000サンプル（動詞化器とコンテキストプロンプトを変動）を用いた実験の結果、テストした手法のうちブートストラップ最頻値が最も較正の良い手法であること（Qwen3-8Bにおいて、回答語の対数確率のECE 25.5%に対し5.7%、Qwen3.6-27Bにおいて13.1%に対し10.3%）、および対数確率ベースラインが低コストで迅速なトリアージ信号として機能することが明らかとなった。コードとパッチ適用済みトレーナーは https://github.com/federicotorrielli/probabilistic_activation_oracles で入手可能である。

English

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.