ChatPaper.aiChatPaper

關於僅解碼器架構用於語音轉文字和大型語言模型整合

On decoder-only architecture for speech-to-text and large language model integration

July 8, 2023
作者: Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu
cs.AI

摘要

大型語言模型(LLMs)在自然語言處理領域取得了顯著成功,使得使用自然語言進行更好的人機交互成為可能。然而,如何無縫地將語音信號整合到LLMs中尚未得到很好的探索。"僅解碼器"架構對於語音處理任務的研究也不夠深入。在這項研究中,我們介紹了Speech-LLaMA,一種新穎的方法,有效地將聲學信息融入基於文本的大型語言模型中。我們的方法利用連接主義暫態分類和一個簡單的音頻編碼器,將壓縮的聲學特徵映射到LLM的連續語義空間中。此外,我們進一步探討僅解碼器架構用於語音轉文本任務,通過僅使用語音文本配對數據訓練一個規模較小且隨機初始化的speech-LLaMA模型。我們在多語言語音轉文本翻譯任務上進行實驗,並展示出明顯優於強基線的改進,突顯了僅解碼器模型在語音轉文本轉換中的潛在優勢。
English
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
PDF70December 15, 2024