ChatPaper.aiChatPaper

Whisper-AT:抗噪自動語音識別器同時也是強大的一般音頻事件標記器。

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

July 6, 2023
作者: Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass
cs.AI

摘要

本文專注於 Whisper,這是一個最近使用龐大的 680,000 小時標註語音語料庫在多樣條件下錄製的自動語音識別模型。我們首先展示了一個有趣的發現,即儘管 Whisper 對現實世界的背景聲音(例如音樂)非常穩健,但其音頻表示實際上並非噪聲不變,而是與非語音聲音高度相關,這表明 Whisper 識別語音時受到噪聲類型的影響。基於這一發現,我們通過凍結 Whisper 的主幹並在其頂部訓練一個輕量級音頻標記模型,建立了統一的音頻標記和語音識別模型 Whisper-AT。通過不到 1% 的額外計算成本,Whisper-AT 可以在單次前向傳遞中識別音頻事件,除了識別口語文本。
English
In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.
PDF100December 15, 2024