ChatPaper.aiChatPaper

在沒有指導訓練的情況下提煉端對端語音助手 資料

Distilling an End-to-End Voice Assistant Without Instruction Training Data

October 3, 2024
作者: William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang
cs.AI

摘要

語音助理,如Siri和Google助理,通常將音訊和文字分開建模,導致語音信息的丟失和複雜性增加。最近的努力通過使用監督微調(SFT)訓練的端到端語音大型語言模型(LLMs)來解決這個問題,導致模型“遺忘”了從僅文本LLMs獲得的功能。我們的工作提出了一種替代範式,用於訓練語音LLMs,無需指導數據,而是使用僅文本LLMs對轉錄的響應作為自我監督。重要的是,這個過程可以在沒有標註響應的情況下進行。我們展示了我們的精煉語音助理(DiVA)可以泛化到口語問答、分類和翻譯。此外,我們展示了DiVA更符合用戶偏好,與Qwen 2 Audio等最先進的模型相比,取得了72%的勝率,盡管使用的訓練計算量少了100倍以上。
English
Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute.

Summary

AI-Generated Summary

PDF235November 16, 2024