多模數據與資源高效的設備導向語音檢測與大型基礎模型

摘要

與虛擬助理的互動通常始於一個觸發詞語，接著是一個指令。在這項研究中，我們探索了通過消除觸發詞語的需求來使這些互動更加自然的可能性。我們的目標是通過從設備麥克風記錄的流式音頻獲得的信號來確定用戶是否在與虛擬助理對話。我們通過將自動語音識別系統的1-best假設和解碼器信號與音頻編碼器的聲學表示結合為輸入特徵，輸入到大型語言模型（LLM）中來解決這個任務。特別地，我們對僅需要少量訓練數據並且可以在僅有一個凍結的LLM可用於設備的情況下運行的數據和資源高效系統感興趣。因此，我們的模型是通過使用低秩適應和前綴調整的組合，在80k或更少的多模態數據示例上進行訓練。我們將所提出的系統與單模基線進行比較，並顯示多模態方法實現了更低的等錯率（EERs），同時僅使用訓練數據的一小部分。我們還表明，低維度的專門音頻表示導致比高維度的一般音頻表示更低的EERs。

English

Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

多模數據與資源高效的設備導向語音檢測與大型基礎模型

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

摘要

Support