MooER:基於LLM的語音識別和翻譯模型,來自Moore Threads
MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
August 9, 2024
作者: Junhao Xu, Zhenlin Liang, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang
cs.AI
摘要
本文介紹了 MooER,一個基於LLM的Moore Threads大規模自動語音識別(ASR)/自動語音翻譯(AST)模型。我們使用包含開源和自行收集的語音數據的5000小時標記偽標記數據集進行訓練。我們實現了與其他使用數十萬小時標記語音數據訓練的開源模型相當的性能。同時,在Covost2 Zh2en測試集上進行的實驗表明,我們的模型優於其他開源語音LLM。可以獲得25.2的BLEU分數。本文的主要貢獻概括如下。首先,本文提出了一種在語音相關任務(包括ASR和AST)上使用少量標記偽標記數據進行編碼器和LLM訓練的策略,而無需任何額外手動標註和選擇。其次,我們釋出了我們的ASR和AST模型,並計劃在不久的將來開源我們的訓練代碼和策略。此外,計劃稍後釋出在8wh規模訓練數據上訓練的模型。
English
In this paper, we present MooER, a LLM-based large-scale automatic speech
recognition (ASR) / automatic speech translation (AST) model of Moore Threads.
A 5000h pseudo labeled dataset containing open source and self collected speech
data is used for training. We achieve performance comparable to other open
source models trained with up to hundreds of thousands of hours of labeled
speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest
that our model outperforms other open source Speech LLMs. A BLEU score of 25.2
can be obtained. The main contributions of this paper are summarized as
follows. First, this paper presents a training strategy for encoders and LLMs
on speech related tasks (including ASR and AST) using a small size of pseudo
labeled data without any extra manual annotation and selection. Second, we
release our ASR and AST models and plan to open-source our training code and
strategy in the near future. Moreover, a model trained on 8wh scale training
data is planned to be released later on.Summary
AI-Generated Summary