关于仅解码器架构用于语音转文本和大型语言模型集成

摘要

大型语言模型（LLMs）在自然语言处理领域取得了显著成功，利用自然语言实现更好的人机交互。然而，如何将语音信号无缝集成到LLMs中尚未得到充分探讨。"仅解码器"架构在语音处理任务中也未被深入研究。在这项研究中，我们引入了Speech-LLaMA，一种新颖方法，有效地将声学信息融入基于文本的大型语言模型中。我们的方法利用连接主义时间分类和一个简单的音频编码器，将压缩的声学特征映射到LLM的连续语义空间中。此外，我们进一步探讨了仅解码器架构用于语音转文本任务，通过仅使用语音-文本配对数据训练一个规模较小且随机初始化的Speech-LLaMA模型。我们在多语言语音转文本翻译任务上进行实验，并展示了明显优于强基线的改进，突显了仅解码器模型在语音转文本转换中的潜在优势。

English

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

关于仅解码器架构用于语音转文本和大型语言模型集成

On decoder-only architecture for speech-to-text and large language model integration

摘要

Support