음성 단위와 텍스트를 위한 통합 언어 모델링을 향하여

초록

음성과 텍스트는 인간 언어의 두 가지 주요 형태입니다. 연구 커뮤니티는 오랜 기간 동안 음성을 텍스트로 또는 그 반대로 매핑하는 데 주력해 왔습니다. 그러나 언어 모델링 분야에서는 이 둘을 함께 모델링하려는 시도가 거의 이루어지지 않았습니다. 이에 따라, 우리는 음성 단위와 텍스트를 위한 통합 언어 모델링을 탐구합니다. 구체적으로, 우리는 연속적인 음성 신호를 이산 단위로 변환하기 위해 다양한 음성 토크나이저를 비교하고, 음성-텍스트 혼합 데이터를 구성하기 위한 다양한 방법을 사용합니다. 또한, 통합 언어 모델이 음성과 텍스트를 얼마나 잘 혼합하는지 평가하기 위한 자동화된 지표를 소개합니다. 우리는 다양한 양식(음성 또는 텍스트)을 사용하여 다운스트림 음성 언어 이해(SLU) 작업에 대해 언어 모델을 미세 조정하고, 공유 표현 학습을 평가하기 위해 모델의 성능을 테스트합니다. 우리의 결과는 제안된 혼합 기술을 통해 음성 단위와 텍스트를 혼합함으로써, 통합 언어 모델이 SLU 작업에서 음성 전용 기준선을 능가하며 제로샷 교차 양식 전이 가능성을 보여줍니다.

English

Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform continuous speech signals into discrete units and use different methods to construct mixed speech-text data. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. We also fine-tune the LM on downstream spoken language understanding (SLU) tasks with different modalities (speech or text) and test its performance to assess the model's learning of shared representations. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks and shows zero-shot cross-modal transferability.

음성 단위와 텍스트를 위한 통합 언어 모델링을 향하여

Toward Joint Language Modeling for Speech Units and Text

초록

Support