Multimodale Gegevens en Resource-efficiënte Detectie van Apparaatgerichte Spraak met Grote Fundamentmodellen

Samenvatting

Interacties met virtuele assistenten beginnen doorgaans met een triggerzin gevolgd door een opdracht. In dit werk onderzoeken we de mogelijkheid om deze interacties natuurlijker te maken door de noodzaak van een triggerzin te elimineren. Ons doel is om te bepalen of een gebruiker de virtuele assistent heeft aangesproken op basis van signalen verkregen uit de streaming audio die door de microfoon van het apparaat is opgenomen. We benaderen deze taak door 1-best hypothesen en decoder-signalen van een automatisch spraakherkenningssysteem te combineren met akoestische representaties van een audio-encoder als invoerkenmerken voor een groot taalmodel (LLM). We zijn met name geïnteresseerd in data- en resource-efficiënte systemen die slechts een kleine hoeveelheid trainingsdata vereisen en kunnen functioneren in scenario's waarbij slechts één bevroren LLM beschikbaar is op een apparaat. Om deze reden is ons model getraind op 80.000 of minder voorbeelden van multimodale data met behulp van een combinatie van low-rank aanpassing en prefix tuning. We vergelijken het voorgestelde systeem met unimodale basislijnen en tonen aan dat de multimodale aanpak lagere equal-error-rates (EERs) bereikt, terwijl slechts een fractie van de trainingsdata wordt gebruikt. We laten ook zien dat laagdimensionale gespecialiseerde audio-representaties leiden tot lagere EERs dan hoogdimensionale algemene audio-representaties.

English

Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

Multimodale Gegevens en Resource-efficiënte Detectie van Apparaatgerichte Spraak met Grote Fundamentmodellen

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Samenvatting

Support