Whisper-AT:抗噪自動語音識別器同時也是強大的一般音頻事件標記器。
Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
July 6, 2023
作者: Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass
cs.AI
摘要
本文專注於 Whisper,這是一個最近使用龐大的 680,000 小時標註語音語料庫在多樣條件下錄製的自動語音識別模型。我們首先展示了一個有趣的發現,即儘管 Whisper 對現實世界的背景聲音(例如音樂)非常穩健,但其音頻表示實際上並非噪聲不變,而是與非語音聲音高度相關,這表明 Whisper 識別語音時受到噪聲類型的影響。基於這一發現,我們通過凍結 Whisper 的主幹並在其頂部訓練一個輕量級音頻標記模型,建立了統一的音頻標記和語音識別模型 Whisper-AT。通過不到 1% 的額外計算成本,Whisper-AT 可以在單次前向傳遞中識別音頻事件,除了識別口語文本。
English
In this paper, we focus on Whisper, a recent automatic speech recognition
model trained with a massive 680k hour labeled speech corpus recorded in
diverse conditions. We first show an interesting finding that while Whisper is
very robust against real-world background sounds (e.g., music), its audio
representation is actually not noise-invariant, but is instead highly
correlated to non-speech sounds, indicating that Whisper recognizes speech
conditioned on the noise type. With this finding, we build a unified audio
tagging and speech recognition model Whisper-AT by freezing the backbone of
Whisper, and training a lightweight audio tagging model on top of it. With <1%
extra computational cost, Whisper-AT can recognize audio events, in addition to
spoken text, in a single forward pass.