다중모달 데이터와 대규모 파운데이션 모델을 활용한 자원 효율적 디바이스 지향 음성 탐지

초록

가상 비서와의 상호작용은 일반적으로 트리거 구문과 이어지는 명령으로 시작됩니다. 본 연구에서는 이러한 상호작용을 더 자연스럽게 만들기 위해 트리거 구문의 필요성을 제거하는 가능성을 탐구합니다. 우리의 목표는 디바이스 마이크로 녹음된 스트리밍 오디오에서 얻은 신호를 기반으로 사용자가 가상 비서에게 말을 건네는지 여부를 판단하는 것입니다. 이 작업을 위해 자동 음성 인식 시스템의 1-최적 가설(1-best hypotheses)과 디코더 신호를 오디오 인코더의 음향 표현과 결합하여 대형 언어 모델(LLM)의 입력 특징으로 사용합니다. 특히, 소량의 학습 데이터만 필요로 하고 디바이스에서 고정된 단일 LLM만 사용 가능한 시나리오에서도 작동할 수 있는 데이터 및 자원 효율적인 시스템에 관심이 있습니다. 이러한 이유로, 우리의 모델은 저순위 적응(low-rank adaptation)과 프리픽스 튜닝(prefix tuning)을 결합하여 80,000개 이하의 다중 모드 데이터 예제로 학습됩니다. 제안된 시스템을 단일 모드 베이스라인과 비교하여 다중 모드 접근 방식이 더 낮은 등위 오류율(EER)을 달성하면서도 훈련 데이터의 일부만 사용함을 보여줍니다. 또한, 저차원의 특화된 오디오 표현이 고차원의 일반 오디오 표현보다 더 낮은 EER을 이끌어냄을 보여줍니다.

English

Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

다중모달 데이터와 대규모 파운데이션 모델을 활용한 자원 효율적 디바이스 지향 음성 탐지

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

초록

Support