음성 이해를 위한 음성-텍스트 어댑터 및 음성-엔티티 검색기 강화 대형 언어 모델

초록

대규모 언어 모델(LLMs)은 음성 영역에 적용되어 왔지만, 음성과 언어 표현 간의 불일치로 인해 종종 성능 저하를 초래해 왔습니다. 이러한 격차를 해소하기 위해, 우리는 Speech2Text 어댑터를 사용한 공동 음성 및 언어 모델(SLM)을 제안합니다. 이 모델은 음성 정보의 손실 없이 음성을 텍스트 토큰 임베딩 공간으로 매핑합니다. 또한 CTC 기반의 공백 필터링을 사용하여 음성 시퀀스 길이를 텍스트 길이로 줄일 수 있습니다. 음성 MultiWoz 데이터셋(DSTC11 챌린지)에서 SLM은 대화 상태 추적(DST) 성능을 크게 향상시켰습니다(24.7%에서 28.4% 정확도). 더 나아가 희귀 엔티티에 대한 오류를 해결하기 위해, 우리는 Speech2Entity 검색기를 추가하여 음성을 통해 관련 엔티티를 검색하고 이를 원래 SLM 입력에 접두사로 추가합니다. 이 검색 강화 SLM(ReSLM)을 사용하면 DST 성능이 34.6% 정확도로 급증합니다. 또한 ASR 작업에 대화 이해 작업을 추가함으로써 ASR 성능을 9.4%에서 8.5% WER로 개선할 수 있습니다.

English

Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, we can reduce the speech sequence length to that of text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to address errors on rare entities, we augment SLM with a Speech2Entity retriever, which uses speech to retrieve relevant entities, and then adds them to the original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with the dialog understanding task improves the ASR performance from 9.4% to 8.5% WER.

음성 이해를 위한 음성-텍스트 어댑터 및 음성-엔티티 검색기 강화 대형 언어 모델

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

초록

Support