無聲勝有聲:針對低資源語言的無語音語音指令訓練
Speechless: Speech Instruction Training Without Speech for Low Resource Languages
May 23, 2025
作者: Alan Dao, Dinh Bach Vu, Huy Hoang Ha, Tuan Le Duc Anh, Shreyas Gopal, Yue Heng Yeo, Warren Keng Hoong Low, Eng Siong Chng, Jia Qi Yip
cs.AI
摘要
由大型語言模型(LLM)驅動的語音助手快速發展,凸顯了訓練這些系統所需的語音指令數據的需求。儘管語音識別數據豐富,但用於微調模型以理解和執行口頭指令的語音指令數據卻顯著匱乏。生成高質量的合成語音需要良好的文本轉語音(TTS)模型,而這對於低資源語言可能並不可用。我們的新方法通過在語義表示層面停止合成,繞過了對TTS的需求,從而應對這一挑戰。我們通過將合成語義表示與預訓練的Whisper編碼器對齊,使LLM能夠在微調文本指令的同時,在推理過程中保持理解口頭指令的能力。這種簡化的訓練過程為構建低資源語言的語音助手提供了一種有前景的途徑。
English
The rapid growth of voice assistants powered by large language models (LLM)
has highlighted a need for speech instruction data to train these systems.
Despite the abundance of speech recognition data, there is a notable scarcity
of speech instruction data, which is essential for fine-tuning models to
understand and execute spoken commands. Generating high-quality synthetic
speech requires a good text-to-speech (TTS) model, which may not be available
to low resource languages. Our novel approach addresses this challenge by
halting synthesis at the semantic representation level, bypassing the need for
TTS. We achieve this by aligning synthetic semantic representations with the
pre-trained Whisper encoder, enabling an LLM to be fine-tuned on text
instructions while maintaining the ability to understand spoken instructions
during inference. This simplified training process is a promising approach to
building voice assistant for low-resource languages.Summary
AI-Generated Summary