音声からテキストへの変換と大規模言語モデルの統合におけるデコーダのみのアーキテクチャについて

要旨

大規模言語モデル（LLM）は自然言語処理の分野で顕著な成功を収め、自然言語を用いた人間とコンピュータのより良いインタラクションを可能にしてきました。しかし、音声信号をLLMにシームレスに統合することは十分に検討されていません。「デコーダのみ」のアーキテクチャも、音声処理タスクにおいて十分に研究されていません。本研究では、音響情報をテキストベースの大規模言語モデルに効果的に組み込む新しいアプローチであるSpeech-LLaMAを紹介します。本手法は、接続主義的時間分類（CTC）とシンプルなオーディオエンコーダを活用し、圧縮された音響特徴をLLMの連続的な意味空間にマッピングします。さらに、音声テキストペアデータのみを使用して、小規模なランダム初期化されたSpeech-LLaMAモデルをトレーニングすることで、音声からテキストへのタスクにおけるデコーダのみのアーキテクチャをさらに探求します。多言語音声テキスト翻訳タスクでの実験を行い、強力なベースラインを大幅に上回る改善を示し、音声からテキストへの変換におけるデコーダのみのモデルの潜在的な利点を強調します。

English

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

音声からテキストへの変換と大規模言語モデルの統合におけるデコーダのみのアーキテクチャについて

On decoder-only architecture for speech-to-text and large language model integration

要旨

Support