Omni-AVSR:基於大型語言模型的統一多模態語音辨識研究
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models
November 10, 2025
作者: Umberto Cappellazzo, Xubo Liu, Pingchuan Ma, Stavros Petridis, Maja Pantic
cs.AI
摘要
大型語言模型在多模態語音識別領域近期取得顯著進展,涵蓋聽覺語音識別、視覺語音識別及音視覺融合語音識別。然而現有基於LLM的方法通常獨立處理各項任務,需訓練多個獨立模型,這不僅增加了計算與部署資源消耗,還錯失了跨任務的協同效應。此外,這些方法依賴固定速率的分詞壓縮機制,限制了精度與效率平衡的靈活性。這些局限凸顯了構建統一框架的必要性——既能支持三類語音識別任務,又可實現彈性推理。為此,我們提出Omni-AVSR這一統一的音視覺大語言模型,通過多粒度高效訓練與參數高效微調相結合的創新方案。具體而言,我們改進套娃表示學習範式,使其能高效適應多層級音視覺特徵訓練,從而降低固有訓練資源消耗。進一步地,我們探索三種基於LoRA的骨幹網絡適配策略,在共享特徵與任務專有特性間實現動態平衡。在LRS2和LRS3數據集上的實驗表明,Omni-AVSR僅需訓練單一模型即可達到與現有頂尖基準模型相當或更優的準確率,同時大幅降低訓練與部署成本。該模型在聲學噪聲環境下保持強韌性,我們還分析了LLM規模擴展時的效能變化規律,為性能與效率的權衡提供了重要見解。
English
Large language models (LLMs) have recently achieved impressive results in
speech recognition across multiple modalities, including Auditory Speech
Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech
Recognition (AVSR). Despite this progress, current LLM-based approaches
typically address each task independently, training separate models that raise
computational and deployment resource use while missing potential cross-task
synergies. They also rely on fixed-rate token compression, which restricts
flexibility in balancing accuracy with efficiency. These limitations highlight
the need for a unified framework that can support ASR, VSR, and AVSR while
enabling elastic inference. To this end, we present Omni-AVSR, a unified
audio-visual LLM that combines efficient multi-granularity training with
parameter-efficient adaptation. Specifically, we adapt the matryoshka
representation learning paradigm to efficiently train across multiple audio and
visual granularities, reducing its inherent training resource use. Furthermore,
we explore three LoRA-based strategies for adapting the backbone LLM, balancing
shared and task-specific specialization. Experiments on LRS2 and LRS3 show that
Omni-AVSR achieves comparable or superior accuracy to state-of-the-art
baselines while training a single model at substantially lower training and
deployment resource use. The model also remains robust under acoustic noise,
and we analyze its scaling behavior as LLM size increases, providing insights
into the trade-off between performance and efficiency.