对话精炼：基于LLM的语音识别中对话音频上下文的抽象压缩

摘要

基于标准大语言模型的语音识别系统通常孤立处理话语，限制了其利用对话上下文的能力。本研究探索了多模态上下文信息是否能提升基于LLM的语音识别性能，以及如何高效表征此类上下文。我们发现，经过监督式多轮对话训练后，会话上下文主要有助于提升语境实体的识别准确率。然而，直接使用原始上下文进行条件建模的计算成本高昂，因为随着对话长度增加，前序轮次的音频标记序列会急剧膨胀。为此，我们提出抽象压缩方法，将前序轮次的音频内容替换为固定数量的学习潜变量标记，同时显式保留对应文本转录。在领域内和领域外测试集上，压缩模型以更小的前序音频存储空间实现了原始上下文条件建模的部分性能增益。我们还对压缩机制及其权衡关系进行了针对性分析。

English

Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.

对话精炼：基于LLM的语音识别中对话音频上下文的抽象压缩

Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

摘要

Support