MooER: Moore ThreadsによるLLMベースの音声認識および翻訳モデル

要旨

本論文では、Moore ThreadsのLLMベースの大規模自動音声認識（ASR）／自動音声翻訳（AST）モデルであるMooERを紹介する。オープンソースおよび独自収集の音声データを含む5000時間の疑似ラベル付きデータセットをトレーニングに使用した。その結果、数十万時間のラベル付き音声データでトレーニングされた他のオープンソースモデルと同等の性能を達成した。一方、Covost2 Zh2enテストセットで実施した実験では、本モデルが他のオープンソースのSpeech LLMを上回る性能を示し、BLEUスコア25.2を達成した。本論文の主な貢献は以下の通りである。第一に、本論文は、追加の手動アノテーションや選択を必要とせず、少量の疑似ラベル付きデータを使用して、音声関連タスク（ASRおよびASTを含む）におけるエンコーダとLLMのトレーニング戦略を提示する。第二に、我々はASRおよびASTモデルを公開し、近い将来にトレーニングコードと戦略をオープンソース化する予定である。さらに、8whスケールのトレーニングデータでトレーニングされたモデルを後日公開する計画である。

English

In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs. A BLEU score of 25.2 can be obtained. The main contributions of this paper are summarized as follows. First, this paper presents a training strategy for encoders and LLMs on speech related tasks (including ASR and AST) using a small size of pseudo labeled data without any extra manual annotation and selection. Second, we release our ASR and AST models and plan to open-source our training code and strategy in the near future. Moreover, a model trained on 8wh scale training data is planned to be released later on.

MooER: Moore ThreadsによるLLMベースの音声認識および翻訳モデル

MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

要旨

Support