隠れ表現ステアリングとスパースオートエンコーダを用いたWhisper幻覚検出と軽減

要旨

広く採用されているASRモデルであるWhisperは、入力と全く無関係な非発話オーディオに対して一貫性のある書き起こしを生成する幻覚（hallucination）を引き起こすことが知られている。本稿では、Whisperの内部表現を通じて幻覚を検出・軽減できるかどうかを調査する。音声エンコーダの活性化を抽出し、Whisperの生の活性化とスパースオートエンコーダ（SAE）潜在変数の2つの表現空間を評価する。両空間とも線形分離可能な幻覚関連情報を符号化しており、識別力はスパースな特徴サブセットに集中し、エンコーダの深い層ほど高まることを示す。次に、活性化空間ステアリングとSAE潜在空間ステアリングの2つのステアリング戦略を提案する。SAEベースのステアリングにより、非発話オーディオのテストセット全体において、Whisper smallでは幻覚率が72.63%から14.11%に、Whisper large-v3では86.88%から27.33%に低下し、音声データに対するWERの低下はわずかで、ファインチューニングベースの手法に迫る性能を示す。

English

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.