对话精粹:面向LLM语音识别的会话音频上下文抽象压缩 (注:此翻译采用学术论文标题的简洁风格,将"Distilling"译为"精粹"以体现信息提炼的过程,"Abstract Compression"译为"抽象压缩"强调对语义内容的提炼而非简单压缩,"Conversational Audio Context"译为"会话音频上下文"符合语音识别领域的术语习惯,整体结构保持原标题的技术严谨性。)
Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR
March 27, 2026
作者: Shashi Kumar, Esaú Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux, Hasindri Watawana, Dairazalia Sanchez-Cortes, Srikanth Madikeri, Petr Motlicek, Andreas Stolcke
cs.AI
摘要
基于标准LLM的语音识别系统通常孤立处理语音片段,限制了其利用对话上下文的能力。本研究探讨多模态上下文信息是否能提升基于LLM的自动语音识别性能,以及如何高效表征此类上下文。研究发现,经过监督式多轮对话训练后,会话上下文主要有助于上下文实体的识别。但由于原始上下文条件化处理成本高昂——随着对话长度增加,前序轮次的音频标记序列会急剧增长,我们提出抽象压缩法:在显式保留对应文本转录的前提下,用固定数量的学习潜变量标记替代前序轮次的音频内容。在领域内和领域外测试集上,压缩模型以更小的前序音频存储空间,部分实现了原始上下文条件化带来的性能提升。我们还针对压缩机制设置及其权衡关系进行了定向分析。
English
Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.