ChatPaper.aiChatPaper

MoME:面向音視語音辨識的套娃專家混合模型

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

October 5, 2025
作者: Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic
cs.AI

摘要

大型語言模型(LLMs)近期在視聽語音識別(AVSR)領域展現出強大潛力,但其高計算需求及對詞元粒度的敏感性,限制了其在資源受限環境中的實用性。詞元壓縮方法雖能降低推理成本,卻需預先固定壓縮率並產生單一固定長度輸出,無法在推理時靈活平衡信息密度與效率。嵌套表示學習(MRL)通過使單一模型能在多種詞元粒度下運作,實現了壓縮率的動態調整,從而解決了這一問題。然而,現有的基於MRL的方法在訓練時將各尺度獨立處理,限制了跨尺度的泛化能力、高壓縮下的魯棒性及可解釋性。為克服這些限制,我們提出了MoME(嵌套專家混合),這是一種新穎框架,將稀疏專家混合(MoE)整合到基於MRL的LLMs中,用於AVSR。MoME通過頂級路由和共享專家增強了固定的LLM,實現了跨尺度和模態的動態容量分配。共享路由器促進了跨粒度的一致性專家激活,使壓縮序列能受益於低壓縮下學習到的表示。在LRS2和LRS3上的實驗表明,MoME在AVSR、ASR和VSR任務中均達到了最先進的性能,同時顯著減少了參數需求並在噪聲下保持了魯棒性。MoME將MRL的適應性與MoE的效率相統一,為資源感知的語音識別提供了一個可擴展且可解釋的解決方案。
English
Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.
PDF32October 7, 2025