언어 모델 내부의 신뢰할 수 있는 해석을 위한 활성화 오라클의 신뢰도 및 보정

초록

활성화 오라클은 다른 모델의 활성화를 인간이 이해할 수 있도록 만드는 것을 목표로 하며, 화이트박스 해석 가능성 기술과 비교하여 유망한 결과를 제공한다. 그러나 이러한 활성화 오라클의 자연어 출력에 대한 불확실성 정량화(UQ)는 현재까지 충분히 연구되지 않았다. 본 연구에서는 활성화 오라클의 신뢰도를 추정하는 6가지 방법을 조사하고, 이들의 신뢰도 점수가 얼마나 잘 보정(calibration)되었는지 평가한다. 오라클당 6,000개 샘플(verbalizer와 컨텍스트 프롬프트를 다양하게 변화)에 대한 실험 결과, 부트스트랩 모드 빈도(bootstrap mode frequency)가 테스트된 방법 중 가장 잘 보정된 방법임을 확인했다(Qwen3-8B에서 응답 단어 로그 확률 대비 ECE 5.7% 대 25.5%; Qwen3.6-27B에서 10.3% 대 13.1%). 또한 로그 확률 기준선(log-prob baseline)은 적은 비용으로 빠른 분류 신호(triage signal) 역할을 할 수 있다. 코드와 패치된 트레이너는 https://github.com/federicotorrielli/probabilistic_activation_oracles에서 확인할 수 있다.

English

Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. However, uncertainty quantification (UQ) for the natural-language outputs of such activation oracles is so far understudied. Here, we investigate 6 different methods for estimating the confidence of activation oracles and evaluate how well-calibrated their confidence scores are. Our experiments on 6,000 samples per oracle (varying verbalizer and context prompts) reveal that bootstrap mode frequency is the best-calibrated method among those tested (ECE 5.7% vs. 25.5% for the answer-word log-probability on Qwen3-8B; 10.3% vs. 13.1% on Qwen3.6-27B), and that the log-prob baseline can serve as a fast triage signal at a fraction of the cost. Code and the patched trainer are available at https://github.com/federicotorrielli/probabilistic_activation_oracles.