從TOWER到SPIRE：為純文本大型語言模型加入語音模態

摘要

大型語言模型（LLMs）在多種語言和任務中展現了卓越的性能和泛化能力，這使得它們成為多模態整合（例如圖像或語音）極具吸引力的目標。在本研究中，我們通過語音離散化和持續預訓練，將現有的LLM擴展至語音模態。特別是，我們對多語言LLM（如TOWER）感興趣，因為其預訓練設置使我們能夠將離散化語音輸入視為一種額外的翻譯語言。由此產生的開源模型SPIRE，能夠轉錄並翻譯英語語音輸入，同時保持TOWER在翻譯相關任務上的原始性能，展示了在LLM適應過程中將離散化語音輸入作為額外語言整合的可行性。我們將代碼和模型公開給社群。

English

Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.

從TOWER到SPIRE：為純文本大型語言模型加入語音模態

From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

摘要

Support