语音转文本适配器和语音到实体检索器增强了用于语音理解的LLM模型。

摘要

大型语言模型（LLMs）已被应用于语音领域，通常由于语音和语言表示之间的不对齐而导致性能下降。为了弥合这一差距，我们提出了一种联合语音和语言模型（SLM），使用Speech2Text适配器，将语音映射到文本令牌嵌入空间，避免了语音信息的丢失。此外，通过基于CTC的空白过滤，我们可以将语音序列长度减少到文本的长度。在语音MultiWoz数据集（DSTC11挑战）中，SLM大大提高了对话状态跟踪（DST）性能（从24.7%提高到28.4%的准确率）。为了解决稀有实体的错误，我们使用Speech2Entity检索器增强了SLM，该检索器使用语音检索相关实体，然后将它们作为前缀添加到原始SLM输入中。通过这种检索增强的SLM（ReSLM），DST性能提升至34.6%的准确率。此外，将ASR任务与对话理解任务相结合，将ASR性能从9.4%提高到8.5%的词错误率（WER）。

English

Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, we can reduce the speech sequence length to that of text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to address errors on rare entities, we augment SLM with a Speech2Entity retriever, which uses speech to retrieve relevant entities, and then adds them to the original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with the dialog understanding task improves the ASR performance from 9.4% to 8.5% WER.

语音转文本适配器和语音到实体检索器增强了用于语音理解的LLM模型。

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

摘要

Support