Whisper-AT：噪声鲁棒的自动语音识别器也是强大的通用音频事件标记器。

摘要

本文关注最近的自动语音识别模型Whisper，该模型是通过在多种条件下录制的大规模680k小时标记语音语料库进行训练的。我们首先展示了一个有趣的发现，即虽然Whisper对真实世界的背景声音（例如音乐）非常稳健，但其音频表示实际上并非噪声不变，而是与非语音声音高度相关，表明Whisper是根据噪声类型识别语音的。基于这一发现，我们构建了一个统一的音频标记和语音识别模型Whisper-AT，通过冻结Whisper的主干结构，并在其之上训练一个轻量级音频标记模型。在不到1%的额外计算成本下，Whisper-AT可以在单次前向传递中识别音频事件，除了识别口头文本。

English

In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.

Whisper-AT：噪声鲁棒的自动语音识别器也是强大的通用音频事件标记器。

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

摘要

Support