大規模言語モデル融合による多言語完全非自己回帰型ASR：包括的研究

要旨

大規模モデルの時代において、デコードの自己回帰的な性質はしばしばレイテンシを重大なボトルネックとして引き起こします。本論文では、アクセラレータハードウェアの並列化能力を効果的に活用する非自己回帰型LM融合ASRシステムを提案します。我々のアプローチは、Universal Speech Model (USM)とPaLM 2言語モデルをセグメント単位のスコアリングモードで組み合わせ、FLEURSでは平均10.8%、YouTubeキャプショニングでは3.6%の相対的なWER改善を達成しました。さらに、我々は包括的なアブレーションスタディを通じて、LLMサイズ、コンテキスト長、語彙サイズ、融合方法論といった主要なパラメータを分析しました。例えば、128Mから340BパラメータまでのLLMサイズがASR性能に与える影響を調査しています。本研究は、実用的な大規模LM融合音声認識システムの有効性に影響を与える要因について貴重な知見を提供します。

English

In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems.

大規模言語モデル融合による多言語完全非自己回帰型ASR：包括的研究

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

要旨

Support