텍스트 기반 사전 학습 음성 언어 모델

초록

음성 언어 모델(SpeechLMs)은 텍스트 감독 없이 오직 음향 데이터만을 처리하고 생성합니다. 본 연구에서는 사전 학습된 텍스트 언어 모델을 활용해 SpeechLMs를 훈련시키는 방법인 TWIST를 제안합니다. 자동 평가와 인간 평가를 통해 TWIST가 초기화된 SpeechLM보다 전반적으로 우수한 성능을 보임을 입증합니다. 음성 토크나이저, 사전 학습된 텍스트 모델, 데이터셋 크기와 같은 다양한 모델 설계 선택의 영향을 실증적으로 분석합니다. 모델과 데이터셋의 규모가 모두 더 나은 성능의 SpeechLMs를 구축하는 데 중요한 역할을 한다는 사실을 발견합니다. 이러한 관찰을 바탕으로, 우리는 현재까지 알려진 가장 큰 규모의 SpeechLM을 파라미터 수와 훈련 데이터 양 측면에서 제시합니다. 또한, 모델 평가를 개선하고 해당 분야의 미래 연구를 촉진하기 위해 StoryCloze 텍스트 벤치마크의 두 가지 음성 버전을 소개합니다. 음성 샘플은 다음 웹사이트에서 확인할 수 있습니다: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

English

Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. Speech samples can be found on our website: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

텍스트 기반 사전 학습 음성 언어 모델

Textually Pretrained Speech Language Models

초록

Support