语音转文本适配器和语音到实体检索器增强了用于语音理解的LLM模型。
Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding
June 8, 2023
作者: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey
cs.AI
摘要
大型语言模型(LLMs)已被应用于语音领域,通常由于语音和语言表示之间的不对齐而导致性能下降。为了弥合这一差距,我们提出了一种联合语音和语言模型(SLM),使用Speech2Text适配器,将语音映射到文本令牌嵌入空间,避免了语音信息的丢失。此外,通过基于CTC的空白过滤,我们可以将语音序列长度减少到文本的长度。在语音MultiWoz数据集(DSTC11挑战)中,SLM大大提高了对话状态跟踪(DST)性能(从24.7%提高到28.4%的准确率)。为了解决稀有实体的错误,我们使用Speech2Entity检索器增强了SLM,该检索器使用语音检索相关实体,然后将它们作为前缀添加到原始SLM输入中。通过这种检索增强的SLM(ReSLM),DST性能提升至34.6%的准确率。此外,将ASR任务与对话理解任务相结合,将ASR性能从9.4%提高到8.5%的词错误率(WER)。
English
Large Language Models (LLMs) have been applied in the speech domain, often
incurring a performance drop due to misaligned between speech and language
representations. To bridge this gap, we propose a joint speech and language
model (SLM) using a Speech2Text adapter, which maps speech into text token
embedding space without speech information loss. Additionally, using a
CTC-based blank-filtering, we can reduce the speech sequence length to that of
text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the
dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to
address errors on rare entities, we augment SLM with a Speech2Entity retriever,
which uses speech to retrieve relevant entities, and then adds them to the
original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the
DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with
the dialog understanding task improves the ASR performance from 9.4% to 8.5%
WER.