多模态数据和资源高效的设备导向语音检测与大型基础模型

摘要

与虚拟助手的互动通常始于一个触发短语，然后是一个命令。在这项工作中，我们探讨通过消除触发短语的需求来使这些互动更加自然的可能性。我们的目标是通过从设备麦克风录制的流式音频获取的信号来确定用户是否在与虚拟助手交流。我们通过将来自自动语音识别系统的1-best假设和解码器信号与音频编码器的声学表示结合作为大型语言模型（LLM）的输入特征来解决这一任务。特别地，我们对仅需要少量训练数据并且可以在仅有一个冻结的LLM的设备上运行的数据和资源高效系统感兴趣。因此，我们的模型是通过使用低秩适应和前缀调整的组合在80k或更少的多模态数据示例上进行训练的。我们将提出的系统与单模基准进行比较，并表明多模态方法实现了更低的等误差率（EERs），同时仅使用训练数据的一小部分。我们还表明，低维度的专门音频表示导致比高维度的一般音频表示更低的EERs。

English

Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

多模态数据和资源高效的设备导向语音检测与大型基础模型

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

摘要

Support