Whisper幻覺檢測與緩解：通過隱藏表示引導與稀疏自編碼器

摘要

Whisper作為廣泛採用的語音辨識模型，已知會產生幻覺現象——即針對非語音音訊生成與輸入完全無關的連貫轉錄內容。我們探討是否能透過Whisper的內部表徵來偵測並緩解此類幻覺。通過提取音訊編碼器激活值，我們評估了兩種表徵空間：原始Whisper激活值與稀疏自編碼器（SAE）潛在變數。研究發現，兩個空間皆編碼了線性可分的幻覺相關資訊，其判別能力集中於稀疏特徵子集，並隨編碼器層數加深而增強。我們提出兩種引導策略：激活空間引導與SAE潛在空間引導。在完整非語音測試集上，基於SAE的引導策略將Whisper small的幻覺率從72.63%降至14.11%，Whisper large-v3則從86.88%降至27.33%，同時對語音資料僅造成微小WER退化，效能已接近基於微調的方法。

English

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.