Whisper-RIR-Mega：音声認識の室内音響に対するロバスト性評価のためのクリーン-残響音声ペアベンチマーク

要旨

本論文では、自動音声認識（ASR）の室内音響に対するロバスト性評価のための、クリーン音声と残響音声のペアからなるベンチマークデータセット「Whisper-RIR-Mega」を提案する。各サンプルは、クリーンなLibriSpeech発話と、RIR-Megaコーパス由来の実測室内インパルス応答で畳み込まれた同一発話とを対応づけたものである。データは残響時間（RT60）と直接音・残響音比（DRR）に基づいて層化分割されている。5つのWhisperモデル（tiny ～ large-v3）を1600のテストサンプルで評価し、クリーン条件および残響条件における単語誤り率（WER）と文字誤り率（CER）を報告する。残響は全てのモデルサイズで一貫して性能を劣化させた。WERにおける残響による性能劣化（残響ペナルティ）は、モデルに応じて0.12～1.07パーセントポイントの範囲であった。再現性のあるロバストASR研究を支援するため、データセット、評価コード、およびベースライン結果を公開する。

English

We introduce Whisper-RIR-Mega, a benchmark dataset of paired clean and reverberant speech for evaluating automatic speech recognition (ASR) robustness to room acoustics. Each sample pairs a clean LibriSpeech utterance with the same utterance convolved with a real room impulse response from the RIR-Mega corpus, with stratified splits by reverberation time (RT60) and direct-to-reverberant ratio (DRR). We evaluate five Whisper models (tiny through large-v3) on 1600 test samples and report word error rate (WER) and character error rate (CER) under clean and reverberant conditions. Reverberation consistently degrades performance across all model sizes; the reverb penalty in WER ranges from 0.12 to 1.07 percentage points depending on the model. We release the dataset, evaluation code, and baseline results to support reproducible research on robust ASR.

Whisper-RIR-Mega：音声認識の室内音響に対するロバスト性評価のためのクリーン-残響音声ペアベンチマーク

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

要旨

Support