음성-텍스트 변환 및 대규모 언어 모델 통합을 위한 디코더 전용 아키텍처에 관하여

초록

대규모 언어 모델(LLM)은 자연어 처리 분야에서 주목할 만한 성과를 거두며, 자연어를 활용한 인간-컴퓨터 상호작용을 개선해 왔습니다. 그러나 음성 신호를 LLM에 원활하게 통합하는 방법은 아직 충분히 탐구되지 않았습니다. 또한 "디코더 전용" 아키텍처는 음성 처리 작업에 대해 잘 연구되지 않았습니다. 본 연구에서는 텍스트 기반 대규모 언어 모델에 음향 정보를 효과적으로 통합하는 새로운 접근 방식인 Speech-LLaMA를 소개합니다. 우리의 방법은 연결주의 시간 분류(CTC)와 간단한 오디오 인코더를 활용하여 압축된 음향 특징을 LLM의 연속적인 의미 공간에 매핑합니다. 또한, 음성-텍스트 쌍 데이터만을 사용하여 무작위로 초기화된 소규모 Speech-LLaMA 모델을 학습함으로써 디코더 전용 아키텍처를 음성-텍스트 작업에 대해 추가로 탐구합니다. 다국어 음성-텍스트 번역 작업에 대한 실험을 수행하고, 강력한 베이스라인 대비 상당한 개선을 보여줌으로써 음성-텍스트 변환을 위한 디코더 전용 모델의 잠재적 이점을 입증합니다.

English

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

음성-텍스트 변환 및 대규모 언어 모델 통합을 위한 디코더 전용 아키텍처에 관하여

On decoder-only architecture for speech-to-text and large language model integration

초록

Support