ChatPaper.aiChatPaper

在没有指导训练的情况下提炼端到端语音助手 数据

Distilling an End-to-End Voice Assistant Without Instruction Training Data

October 3, 2024
作者: William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang
cs.AI

摘要

语音助手,如Siri和Google助手,通常将音频和文本分开建模,导致语音信息丢失并增加复杂性。最近的努力通过端到端语音大型语言模型(LLMs)进行训练,采用监督微调(SFT)来解决这个问题,导致模型“遗忘”了来自仅文本LLMs的功能。我们的工作提出了一种替代范式,用于训练语音LLMs而无需指导数据,利用仅文本LLMs对转录的响应作为自我监督。重要的是,这个过程可以在没有注释响应的情况下进行。我们展示了我们的精简语音助手(DiVA)推广到口语问答、分类和翻译。此外,我们展示了DiVA更好地符合用户偏好,与Qwen 2 Audio等最先进模型相比,获得了72%的胜率,尽管使用的训练计算量少于100倍。
English
Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute.

Summary

AI-Generated Summary

PDF235November 16, 2024