關於僅解碼器架構用於語音轉文字和大型語言模型整合

摘要

大型語言模型（LLMs）在自然語言處理領域取得了顯著成功，使得使用自然語言進行更好的人機交互成為可能。然而，如何無縫地將語音信號整合到LLMs中尚未得到很好的探索。"僅解碼器"架構對於語音處理任務的研究也不夠深入。在這項研究中，我們介紹了Speech-LLaMA，一種新穎的方法，有效地將聲學信息融入基於文本的大型語言模型中。我們的方法利用連接主義暫態分類和一個簡單的音頻編碼器，將壓縮的聲學特徵映射到LLM的連續語義空間中。此外，我們進一步探討僅解碼器架構用於語音轉文本任務，通過僅使用語音文本配對數據訓練一個規模較小且隨機初始化的speech-LLaMA模型。我們在多語言語音轉文本翻譯任務上進行實驗，並展示出明顯優於強基線的改進，突顯了僅解碼器模型在語音轉文本轉換中的潛在優勢。

English

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

關於僅解碼器架構用於語音轉文字和大型語言模型整合

On decoder-only architecture for speech-to-text and large language model integration

摘要

Support