基于隐表示引导与稀疏自编码器的Whisper模型幻觉检测与缓解

摘要

Whisper作为广泛使用的自动语音识别模型，已知存在生成幻觉的问题——对非语音音频生成与输入完全无关的连贯转录文本。我们探究能否通过Whisper的内部表征检测并缓解这种幻觉现象。通过提取音频编码器激活值，我们评估了两种表征空间：原始Whisper激活值和稀疏自编码器（SAE）潜在特征。研究表明，这两个空间均编码了线性可分的幻觉相关信息，其判别能力集中在稀疏特征子集中，并随编码器层数加深而增强。我们提出两种引导策略：激活空间引导与SAE潜在空间引导。基于SAE的引导策略在完整非语音测试集上，将Whisper small的幻觉率从72.63%降至14.11%，Whisper large-v3的幻觉率从86.88%降至27.33%，同时语音数据上的词错误率仅小幅退化，性能接近基于微调的方法。

English

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.