從TOWER到SPIRE:為純文本大型語言模型加入語音模態
From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
March 13, 2025
作者: Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, Marcely Zanon Boito, André F. T. Martins
cs.AI
摘要
大型語言模型(LLMs)在多種語言和任務中展現了卓越的性能和泛化能力,這使得它們成為多模態整合(例如圖像或語音)極具吸引力的目標。在本研究中,我們通過語音離散化和持續預訓練,將現有的LLM擴展至語音模態。特別是,我們對多語言LLM(如TOWER)感興趣,因為其預訓練設置使我們能夠將離散化語音輸入視為一種額外的翻譯語言。由此產生的開源模型SPIRE,能夠轉錄並翻譯英語語音輸入,同時保持TOWER在翻譯相關任務上的原始性能,展示了在LLM適應過程中將離散化語音輸入作為額外語言整合的可行性。我們將代碼和模型公開給社群。
English
Large language models (LLMs) have shown remarkable performance and
generalization capabilities across multiple languages and tasks, making them
very attractive targets for multi-modality integration (e.g., images or
speech). In this work, we extend an existing LLM to the speech modality via
speech discretization and continued pre-training. In particular, we are
interested in multilingual LLMs, such as TOWER, as their pre-training setting
allows us to treat discretized speech input as an additional translation
language. The resulting open-source model, SPIRE, is able to transcribe and
translate English speech input while maintaining TOWER's original performance
on translation-related tasks, showcasing that discretized speech input
integration as an additional language is feasible during LLM adaptation. We
make our code and models available to the community.Summary
AI-Generated Summary