基于隐表示引导与稀疏自编码器的Whisper模型幻觉检测与缓解
Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders
June 5, 2026
作者: Georgii Aparin, Vadim Popov, Tasnima Sadekova, Assel Yermekova
cs.AI
摘要
Whisper作为广泛使用的自动语音识别模型,已知存在生成幻觉的问题——对非语音音频生成与输入完全无关的连贯转录文本。我们探究能否通过Whisper的内部表征检测并缓解这种幻觉现象。通过提取音频编码器激活值,我们评估了两种表征空间:原始Whisper激活值和稀疏自编码器(SAE)潜在特征。研究表明,这两个空间均编码了线性可分的幻觉相关信息,其判别能力集中在稀疏特征子集中,并随编码器层数加深而增强。我们提出两种引导策略:激活空间引导与SAE潜在空间引导。基于SAE的引导策略在完整非语音测试集上,将Whisper small的幻觉率从72.63%降至14.11%,Whisper large-v3的幻觉率从86.88%降至27.33%,同时语音数据上的词错误率仅小幅退化,性能接近基于微调的方法。
English
Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.