MooER：基於LLM的語音識別和翻譯模型，來自Moore Threads

摘要

本文介紹了 MooER，一個基於LLM的Moore Threads大規模自動語音識別（ASR）/自動語音翻譯（AST）模型。我們使用包含開源和自行收集的語音數據的5000小時標記偽標記數據集進行訓練。我們實現了與其他使用數十萬小時標記語音數據訓練的開源模型相當的性能。同時，在Covost2 Zh2en測試集上進行的實驗表明，我們的模型優於其他開源語音LLM。可以獲得25.2的BLEU分數。本文的主要貢獻概括如下。首先，本文提出了一種在語音相關任務（包括ASR和AST）上使用少量標記偽標記數據進行編碼器和LLM訓練的策略，而無需任何額外手動標註和選擇。其次，我們釋出了我們的ASR和AST模型，並計劃在不久的將來開源我們的訓練代碼和策略。此外，計劃稍後釋出在8wh規模訓練數據上訓練的模型。

English

In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs. A BLEU score of 25.2 can be obtained. The main contributions of this paper are summarized as follows. First, this paper presents a training strategy for encoders and LLMs on speech related tasks (including ASR and AST) using a small size of pseudo labeled data without any extra manual annotation and selection. Second, we release our ASR and AST models and plan to open-source our training code and strategy in the near future. Moreover, a model trained on 8wh scale training data is planned to be released later on.

MooER：基於LLM的語音識別和翻譯模型，來自Moore Threads

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

摘要

Support