TOWERからSPIREへ：テキスト専用LLMに音声モダリティを追加する

要旨

大規模言語モデル（LLM）は、複数の言語やタスクにおいて優れた性能と汎化能力を示しており、画像や音声などのマルチモーダリティ統合の対象として非常に魅力的です。本研究では、既存のLLMを音声モダリティに拡張するため、音声の離散化と継続的な事前学習を行いました。特に、TOWERのような多言語LLMに注目しています。これらのモデルの事前学習設定により、離散化された音声入力を追加の翻訳言語として扱うことが可能です。その結果として開発されたオープンソースモデル、SPIREは、英語音声の書き起こしと翻訳を行いながら、TOWERの翻訳関連タスクにおける元の性能を維持することができます。これは、LLMの適応中に離散化された音声入力を追加言語として統合することが可能であることを示しています。私たちは、コードとモデルをコミュニティに公開しています。

English

Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.

TOWERからSPIREへ：テキスト専用LLMに音声モダリティを追加する

From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

要旨

Support