Whisper-AT：抗噪自動語音識別器同時也是強大的一般音頻事件標記器。

摘要

本文專注於 Whisper，這是一個最近使用龐大的 680,000 小時標註語音語料庫在多樣條件下錄製的自動語音識別模型。我們首先展示了一個有趣的發現，即儘管 Whisper 對現實世界的背景聲音（例如音樂）非常穩健，但其音頻表示實際上並非噪聲不變，而是與非語音聲音高度相關，這表明 Whisper 識別語音時受到噪聲類型的影響。基於這一發現，我們通過凍結 Whisper 的主幹並在其頂部訓練一個輕量級音頻標記模型，建立了統一的音頻標記和語音識別模型 Whisper-AT。通過不到 1% 的額外計算成本，Whisper-AT 可以在單次前向傳遞中識別音頻事件，除了識別口語文本。

English

In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.

Whisper-AT：抗噪自動語音識別器同時也是強大的一般音頻事件標記器。

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

摘要

Support