具有大型语言模型融合的多语言全非自回归自动语音识别：一项全面研究

摘要

在大模型时代，解码的自回归特性常常导致延迟成为一个重要瓶颈。我们提出了一种非自回归的LM融合ASR系统，有效地利用了加速硬件的并行化能力。我们的方法将通用语音模型（USM）和PaLM 2语言模型以每段评分模式相结合，在FLEURS和YouTube字幕中实现了所有语言的平均相对WER改进，分别为10.8%和3.6%。此外，我们的全面消融研究分析了关键参数，如LLM大小、上下文长度、词汇量、融合方法等。例如，我们探讨了LLM大小从128M到340B参数对ASR性能的影响。这项研究为影响实用大规模LM融合语音识别系统有效性的因素提供了宝贵的见解。

English

In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems.

具有大型语言模型融合的多语言全非自回归自动语音识别：一项全面研究

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

摘要

Support