ChatPaper.aiChatPaper

Whisper-AT:噪声鲁棒的自动语音识别器也是强大的通用音频事件标记器。

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

July 6, 2023
作者: Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass
cs.AI

摘要

本文关注最近的自动语音识别模型Whisper,该模型是通过在多种条件下录制的大规模680k小时标记语音语料库进行训练的。我们首先展示了一个有趣的发现,即虽然Whisper对真实世界的背景声音(例如音乐)非常稳健,但其音频表示实际上并非噪声不变,而是与非语音声音高度相关,表明Whisper是根据噪声类型识别语音的。基于这一发现,我们构建了一个统一的音频标记和语音识别模型Whisper-AT,通过冻结Whisper的主干结构,并在其之上训练一个轻量级音频标记模型。在不到1%的额外计算成本下,Whisper-AT可以在单次前向传递中识别音频事件,除了识别口头文本。
English
In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.
PDF100December 15, 2024