关于仅解码器架构用于语音转文本和大型语言模型集成
On decoder-only architecture for speech-to-text and large language model integration
July 8, 2023
作者: Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu
cs.AI
摘要
大型语言模型(LLMs)在自然语言处理领域取得了显著成功,利用自然语言实现更好的人机交互。然而,如何将语音信号无缝集成到LLMs中尚未得到充分探讨。"仅解码器"架构在语音处理任务中也未被深入研究。在这项研究中,我们引入了Speech-LLaMA,一种新颖方法,有效地将声学信息融入基于文本的大型语言模型中。我们的方法利用连接主义时间分类和一个简单的音频编码器,将压缩的声学特征映射到LLM的连续语义空间中。此外,我们进一步探讨了仅解码器架构用于语音转文本任务,通过仅使用语音-文本配对数据训练一个规模较小且随机初始化的Speech-LLaMA模型。我们在多语言语音转文本翻译任务上进行实验,并展示了明显优于强基线的改进,突显了仅解码器模型在语音转文本转换中的潜在优势。
English
Large language models (LLMs) have achieved remarkable success in the field of
natural language processing, enabling better human-computer interaction using
natural language. However, the seamless integration of speech signals into LLMs
has not been explored well. The "decoder-only" architecture has also not been
well studied for speech processing tasks. In this research, we introduce
Speech-LLaMA, a novel approach that effectively incorporates acoustic
information into text-based large language models. Our method leverages
Connectionist Temporal Classification and a simple audio encoder to map the
compressed acoustic features to the continuous semantic space of the LLM. In
addition, we further probe the decoder-only architecture for speech-to-text
tasks by training a smaller scale randomly initialized speech-LLaMA model from
speech-text paired data alone. We conduct experiments on multilingual
speech-to-text translation tasks and demonstrate a significant improvement over
strong baselines, highlighting the potential advantages of decoder-only models
for speech-to-text conversion.