前進する失敗：合成データと検索拡張を用いた音声認識のための生成誤り訂正の改善

要旨

生成誤り訂正（GEC）は、自動音声認識（ASR）システムの性能を向上させるための強力な事後処理手法として台頭しています。しかし、GECモデルは、訓練中に遭遇した特定の種類の誤りを超えて一般化するのに苦労することを示しており、特にドメイン外（OOD）のシナリオにおいて新しい見慣れない誤りを修正する能力が制限されています。この現象は、固有表現（NEs）において顕著であり、NEsに関する不十分な文脈情報や知識に加えて、新しいNEsが続々と現れることがあります。これらの問題に対処するために、私たちはDARAG（Data- and Retrieval-Augmented Generative Error Correction）を提案します。これは、IDおよびOODシナリオにおいてASRのためのGECを改善するために設計された新しいアプローチです。私たちは、LLMやテキスト音声モデルを促して生成された合成データでGECトレーニングデータセットを拡張し、モデルが学習できる追加の誤りをシミュレートします。OODシナリオでは、同様に新しいドメインからテスト時の誤りを非監督的にシミュレートします。さらに、固有表現をより適切に処理するために、データベースから取得したエンティティを入力に追加することで、検索増強修正を導入します。私たちのアプローチはシンプルで拡張可能であり、ドメインや言語に依存しません。私たちは複数のデータセットと設定で実験を行い、DARAGがすべてのベースラインを上回り、IDでは8％から30％の相対WER改善を達成し、OOD設定では10％から33％の改善を示すことを示しました。

English

Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC training dataset with synthetic data generated by prompting LLMs and text-to-speech models, thereby simulating additional errors from which the model can learn. For OOD scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle named entities, we introduce retrieval-augmented correction by augmenting the input with entities retrieved from a database. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8\% -- 30\% relative WER improvements in ID and 10\% -- 33\% improvements in OOD settings.

前進する失敗：合成データと検索拡張を用いた音声認識のための生成誤り訂正の改善

Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

要旨

Support