Verso un Modellamento Linguistico Congiunto per Unità Fonetiche e Testo

Abstract

Il parlato e il testo sono due delle principali forme di linguaggio umano. La comunità di ricerca si è concentrata per molti anni sulla mappatura del parlato in testo o viceversa. Tuttavia, nel campo della modellazione del linguaggio, sono stati fatti pochissimi sforzi per modellarli congiuntamente. Alla luce di ciò, esploriamo la modellazione congiunta del linguaggio per unità di parlato e testo. Nello specifico, confrontiamo diversi tokenizer di parlato per trasformare segnali vocali continui in unità discrete e utilizziamo diversi metodi per costruire dati misti di parlato e testo. Introduciamo metriche automatiche per valutare quanto bene il modello di linguaggio congiunto (LM) miscela parlato e testo. Inoltre, ottimizziamo il LM su task di comprensione del linguaggio parlato (SLU) con diverse modalità (parlato o testo) e testiamo le sue prestazioni per valutare l'apprendimento di rappresentazioni condivise da parte del modello. I nostri risultati dimostrano che, miscelando unità di parlato e testo con le nostre tecniche proposte, il LM congiunto migliora rispetto a una baseline basata esclusivamente sul parlato nei task SLU e mostra una trasferibilità cross-modale zero-shot.

English

Speech and text are two major forms of human language. The research community has been focusing on mapping speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform continuous speech signals into discrete units and use different methods to construct mixed speech-text data. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. We also fine-tune the LM on downstream spoken language understanding (SLU) tasks with different modalities (speech or text) and test its performance to assess the model's learning of shared representations. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks and shows zero-shot cross-modal transferability.

Verso un Modellamento Linguistico Congiunto per Unità Fonetiche e Testo

Toward Joint Language Modeling for Speech Units and Text

Abstract

Support