지시 훈련 없이 엔드 투 엔드 음성 어시스턴트 축소하기 데이터

초록

Siri 및 Google 어시스턴트와 같은 음성 어시스턴트는 일반적으로 오디오와 텍스트를 별도로 모델링하여 음성 정보 손실과 복잡성 증가를 초래합니다. 최근의 노력은 이를 해결하기 위해 감독된 세밀 조정(SFT)으로 훈련된 엔드-투-엔드 음성 대규모 언어 모델 (LLM)을 사용하여 텍스트 전용 LLM에서 능력을 "잊는" 모델을 이끌었습니다. 저희 연구는 지도 데이터 없이 음성 LLM을 훈련하는 대안적 패러다임을 제안하며, 텍스트 전용 LLM의 응답을 자가 감독으로 사용합니다. 중요한 점은 이 과정이 주석이 달린 응답 없이 수행될 수 있다는 것입니다. 저희는 Distilled Voice Assistant (DiVA)가 말하기 질문 응답, 분류 및 번역에 대해 일반화되는 것을 보여줍니다. 게다가, DiVA가 Qwen 2 Audio와 같은 최첨단 모델보다 100배 이상 적은 훈련 계산을 사용하더라도 사용자 선호도를 더 잘 충족시키며 72%의 승률을 달성한다는 것을 보여줍니다.

English

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute.

지시 훈련 없이 엔드 투 엔드 음성 어시스턴트 축소하기 데이터

Distilling an End-to-End Voice Assistant Without Instruction Training Data

초록

Support