Whisper 할루시네이션 탐지 및 완화: 은닉 표현 조정과 희소 오토인코더를 통한

초록

Whisper는 널리 사용되는 ASR 모델로, 입력과 전혀 관련 없는 비음성 오디오에 대해 일관된 전사를 생성하는 환각 현상을 겪는 것으로 알려져 있다. 본 연구에서는 Whisper의 내부 표현을 통해 환각을 탐지하고 완화할 수 있는지 조사한다. 오디오 인코더 활성화를 추출하고 두 가지 표현 공간, 즉 원시 Whisper 활성화와 희소 오토인코더(SAE) 잠재 변수를 평가한다. 두 공간 모두 선형적으로 분리 가능한 환각 관련 정보를 인코딩하며, 판별 능력은 희소한 특징 부분집합에 집중되고 더 깊은 인코더 층으로 갈수록 증가함을 보여준다. 우리는 활성화 공간 조정과 SAE 잠재 공간 조정이라는 두 가지 조정 전략을 제안한다. SAE 기반 조정은 전체 비음성 테스트 세트에서 Whisper small의 환각률을 72.63%에서 14.11%로, Whisper large-v3의 경우 86.88%에서 27.33%로 감소시키며, 음성 데이터에 대한 WER 저하가 미미하여 파인튜닝 기반 방법에 근접하는 성능을 보인다.

English

Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.