Whisper-AT: ノイズに強い自動音声認識器は優れた汎用音響イベントタガーでもある

要旨

本論文では、多様な条件下で記録された680,000時間ものラベル付き音声コーパスで学習された最近の自動音声認識モデルWhisperに焦点を当てる。まず、Whisperが実世界の背景音（例：音楽）に対して非常にロバストである一方、その音声表現は実際にはノイズ不変ではなく、非音声信号と高い相関関係にあるという興味深い発見を示す。これは、Whisperがノイズの種類を条件として音声を認識していることを示唆している。この発見を基に、Whisperのバックボーンを凍結し、その上に軽量な音声タグ付けモデルを学習させることで、統合型音声タグ付け・音声認識モデルWhisper-ATを構築した。Whisper-ATは、追加の計算コストが1%未満で、単一のフォワードパスで話されたテキストに加えて音声イベントも認識することができる。

English

In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.

Whisper-AT: ノイズに強い自動音声認識器は優れた汎用音響イベントタガーでもある

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

要旨

Support