无言以对:面向低资源语言的无声语音指令训练
Speechless: Speech Instruction Training Without Speech for Low Resource Languages
May 23, 2025
作者: Alan Dao, Dinh Bach Vu, Huy Hoang Ha, Tuan Le Duc Anh, Shreyas Gopal, Yue Heng Yeo, Warren Keng Hoong Low, Eng Siong Chng, Jia Qi Yip
cs.AI
摘要
大型语言模型(LLM)驱动的语音助手快速发展,凸显了对训练这些系统所需的语音指令数据的需求。尽管语音识别数据丰富,但用于微调模型以理解和执行口头指令的语音指令数据却显著匮乏。生成高质量的合成语音需要良好的文本转语音(TTS)模型,而低资源语言可能无法获得此类模型。我们的创新方法通过在语义表示层面停止合成,绕过了对TTS的需求,从而应对这一挑战。我们通过将合成语义表示与预训练的Whisper编码器对齐,实现了这一目标,使得LLM能够在文本指令上进行微调,同时保持推理过程中理解语音指令的能力。这一简化的训练过程为构建低资源语言的语音助手提供了一种有前景的途径。
English
The rapid growth of voice assistants powered by large language models (LLM)
has highlighted a need for speech instruction data to train these systems.
Despite the abundance of speech recognition data, there is a notable scarcity
of speech instruction data, which is essential for fine-tuning models to
understand and execute spoken commands. Generating high-quality synthetic
speech requires a good text-to-speech (TTS) model, which may not be available
to low resource languages. Our novel approach addresses this challenge by
halting synthesis at the semantic representation level, bypassing the need for
TTS. We achieve this by aligning synthetic semantic representations with the
pre-trained Whisper encoder, enabling an LLM to be fine-tuned on text
instructions while maintaining the ability to understand spoken instructions
during inference. This simplified training process is a promising approach to
building voice assistant for low-resource languages.Summary
AI-Generated Summary