从TOWER到SPIRE:为纯文本大语言模型增添语音模态
From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
March 13, 2025
作者: Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, Marcely Zanon Boito, André F. T. Martins
cs.AI
摘要
大型语言模型(LLMs)在多种语言和任务中展现出了卓越的性能与泛化能力,这使其成为多模态集成(如图像或语音)极具吸引力的目标。在本研究中,我们通过语音离散化及持续预训练,将现有LLM扩展至语音模态。特别地,我们对多语言LLM(如TOWER)感兴趣,因其预训练设置允许我们将离散化语音输入视为一种额外的翻译语言。由此产生的开源模型SPIRE,能够转录并翻译英语语音输入,同时保持TOWER在翻译相关任务上的原有性能,证明了在LLM适应过程中将离散化语音输入作为附加语言进行集成是可行的。我们向社区公开了代码与模型。
English
Large language models (LLMs) have shown remarkable performance and
generalization capabilities across multiple languages and tasks, making them
very attractive targets for multi-modality integration (e.g., images or
speech). In this work, we extend an existing LLM to the speech modality via
speech discretization and continued pre-training. In particular, we are
interested in multilingual LLMs, such as TOWER, as their pre-training setting
allows us to treat discretized speech input as an additional translation
language. The resulting open-source model, SPIRE, is able to transcribe and
translate English speech input while maintaining TOWER's original performance
on translation-related tasks, showcasing that discretized speech input
integration as an additional language is feasible during LLM adaptation. We
make our code and models available to the community.Summary
AI-Generated Summary