MooER:基于LLM的语音识别和翻译模型来自Moore Threads
MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
August 9, 2024
作者: Junhao Xu, Zhenlin Liang, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang
cs.AI
摘要
本文介绍了MooER,一种基于LLM的大规模自动语音识别(ASR)/自动语音翻译(AST)模型。使用包含开源和自采集语音数据的5000小时伪标记数据集进行训练。我们实现了与其他使用数十万小时标记语音数据训练的开源模型相媲美的性能。同时,在Covost2 Zh2en测试集上进行的实验表明,我们的模型优于其他开源语音LLM模型。可以获得25.2的BLEU分数。本文的主要贡献总结如下。首先,本文提出了一种针对语音相关任务(包括ASR和AST)的编码器和LLM的训练策略,使用少量伪标记数据,无需额外的手动注释和选择。其次,我们发布了我们的ASR和AST模型,并计划在不久的将来开源我们的训练代码和策略。此外,计划稍后发布一个在8wh规模训练数据上训练的模型。
English
In this paper, we present MooER, a LLM-based large-scale automatic speech
recognition (ASR) / automatic speech translation (AST) model of Moore Threads.
A 5000h pseudo labeled dataset containing open source and self collected speech
data is used for training. We achieve performance comparable to other open
source models trained with up to hundreds of thousands of hours of labeled
speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest
that our model outperforms other open source Speech LLMs. A BLEU score of 25.2
can be obtained. The main contributions of this paper are summarized as
follows. First, this paper presents a training strategy for encoders and LLMs
on speech related tasks (including ASR and AST) using a small size of pseudo
labeled data without any extra manual annotation and selection. Second, we
release our ASR and AST models and plan to open-source our training code and
strategy in the near future. Moreover, a model trained on 8wh scale training
data is planned to be released later on.Summary
AI-Generated Summary