ChatPaper.aiChatPaper

Omni-AVSR:基于大语言模型的统一多模态语音识别研究

Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

November 10, 2025
作者: Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, Stavros Petridis, Maja Pantic
cs.AI

摘要

近期,大型语言模型在多模态语音识别领域取得显著进展,涵盖听觉语音识别(ASR)、视觉语音识别(VSR)及视听融合语音识别(AVSR)。然而,现有基于LLM的方法通常独立处理各项任务,需训练多个独立模型,导致计算与部署资源消耗增加,且未能充分利用跨任务协同潜力。这些方法还依赖固定速率的分词压缩机制,限制了精度与效率平衡的灵活性。上述局限凸显了对支持ASR、VSR和AVSR任务并实现弹性推理的统一框架的需求。为此,我们提出Omni-AVSR——一种融合高效多粒度训练与参数有效性适配的统一视听大模型。具体而言,我们采用套娃表示学习范式,在多种音频与视觉粒度上进行高效训练,降低其固有训练资源消耗;同时探索三种基于LoRA的主干网络适配策略,平衡共享与任务专属特性。在LRS2和LRS3数据集上的实验表明,Omni-AVSR仅需训练单一模型,即以显著更低的训练和部署资源达到与最先进基线相当或更优的准确率。该模型在声学噪声环境下保持稳健,我们进一步分析了其随LLM规模扩大的缩放特性,为性能与效率的权衡关系提供新见解。
English
Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.
PDF22December 2, 2025