無聲勝有聲：針對低資源語言的無語音語音指令訓練

摘要

由大型語言模型（LLM）驅動的語音助手快速發展，凸顯了訓練這些系統所需的語音指令數據的需求。儘管語音識別數據豐富，但用於微調模型以理解和執行口頭指令的語音指令數據卻顯著匱乏。生成高質量的合成語音需要良好的文本轉語音（TTS）模型，而這對於低資源語言可能並不可用。我們的新方法通過在語義表示層面停止合成，繞過了對TTS的需求，從而應對這一挑戰。我們通過將合成語義表示與預訓練的Whisper編碼器對齊，使LLM能夠在微調文本指令的同時，在推理過程中保持理解口頭指令的能力。這種簡化的訓練過程為構建低資源語言的語音助手提供了一種有前景的途徑。

English

The rapid growth of voice assistants powered by large language models (LLM) has highlighted a need for speech instruction data to train these systems. Despite the abundance of speech recognition data, there is a notable scarcity of speech instruction data, which is essential for fine-tuning models to understand and execute spoken commands. Generating high-quality synthetic speech requires a good text-to-speech (TTS) model, which may not be available to low resource languages. Our novel approach addresses this challenge by halting synthesis at the semantic representation level, bypassing the need for TTS. We achieve this by aligning synthetic semantic representations with the pre-trained Whisper encoder, enabling an LLM to be fine-tuned on text instructions while maintaining the ability to understand spoken instructions during inference. This simplified training process is a promising approach to building voice assistant for low-resource languages.

無聲勝有聲：針對低資源語言的無語音語音指令訓練

Speechless: Speech Instruction Training Without Speech for Low Resource Languages

摘要

Support