語音轉文字轉換器和語音到實體檢索器為增強式LLM提供語音理解功能。
Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding
June 8, 2023
作者: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey
cs.AI
摘要
大型語言模型(LLMs)已被應用於語音領域,通常由於語音和語言表示之間的不一致而導致性能下降。為了彌補這一差距,我們提出了一種聯合語音和語言模型(SLM),使用Speech2Text轉換器將語音映射到文本標記嵌入空間,而不會丟失語音信息。此外,通過基於CTC的空白過濾,我們可以將語音序列長度減少到與文本相同。在語音MultiWoz數據集(DSTC11挑戰賽)中,SLM顯著提高了對話狀態追踪(DST)性能(從24.7%提高到28.4%的準確率)。為了解決罕見實體的錯誤,我們使用Speech2Entity檢索器對SLM進行擴充,該檢索器使用語音檢索相關實體,然後將其添加到原始SLM輸入作為前綴。通過這種檢索增強的SLM(ReSLM),DST性能提高到34.6%的準確率。此外,將ASR任務與對話理解任務相結合,將ASR性能從9.4%提高到8.5%的錯字率。
English
Large Language Models (LLMs) have been applied in the speech domain, often
incurring a performance drop due to misaligned between speech and language
representations. To bridge this gap, we propose a joint speech and language
model (SLM) using a Speech2Text adapter, which maps speech into text token
embedding space without speech information loss. Additionally, using a
CTC-based blank-filtering, we can reduce the speech sequence length to that of
text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the
dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to
address errors on rare entities, we augment SLM with a Speech2Entity retriever,
which uses speech to retrieve relevant entities, and then adds them to the
original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the
DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with
the dialog understanding task improves the ASR performance from 9.4% to 8.5%
WER.