对话精粹：面向LLM语音识别的会话音频上下文抽象压缩（注：此翻译采用学术论文标题的简洁风格，将"Distilling"译为"精粹"以体现信息提炼的过程，"Abstract Compression"译为"抽象压缩"强调对语义内容的提炼而非简单压缩，"Conversational Audio Context"译为"会话音频上下文"符合语音识别领域的术语习惯，整体结构保持原标题的技术严谨性。）

摘要

基于标准LLM的语音识别系统通常孤立处理语音片段，限制了其利用对话上下文的能力。本研究探讨多模态上下文信息是否能提升基于LLM的自动语音识别性能，以及如何高效表征此类上下文。研究发现，经过监督式多轮对话训练后，会话上下文主要有助于上下文实体的识别。但由于原始上下文条件化处理成本高昂——随着对话长度增加，前序轮次的音频标记序列会急剧增长，我们提出抽象压缩法：在显式保留对应文本转录的前提下，用固定数量的学习潜变量标记替代前序轮次的音频内容。在领域内和领域外测试集上，压缩模型以更小的前序音频存储空间，部分实现了原始上下文条件化带来的性能提升。我们还针对压缩机制设置及其权衡关系进行了定向分析。

English

Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.

Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

摘要

Support