MoME:面向音視語音辨識的套娃專家混合模型
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
October 5, 2025
作者: Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic
cs.AI
摘要
大型語言模型(LLMs)近期在視聽語音識別(AVSR)領域展現出強大潛力,但其高計算需求及對詞元粒度的敏感性,限制了其在資源受限環境中的實用性。詞元壓縮方法雖能降低推理成本,卻需預先固定壓縮率並產生單一固定長度輸出,無法在推理時靈活平衡信息密度與效率。嵌套表示學習(MRL)通過使單一模型能在多種詞元粒度下運作,實現了壓縮率的動態調整,從而解決了這一問題。然而,現有的基於MRL的方法在訓練時將各尺度獨立處理,限制了跨尺度的泛化能力、高壓縮下的魯棒性及可解釋性。為克服這些限制,我們提出了MoME(嵌套專家混合),這是一種新穎框架,將稀疏專家混合(MoE)整合到基於MRL的LLMs中,用於AVSR。MoME通過頂級路由和共享專家增強了固定的LLM,實現了跨尺度和模態的動態容量分配。共享路由器促進了跨粒度的一致性專家激活,使壓縮序列能受益於低壓縮下學習到的表示。在LRS2和LRS3上的實驗表明,MoME在AVSR、ASR和VSR任務中均達到了最先進的性能,同時顯著減少了參數需求並在噪聲下保持了魯棒性。MoME將MRL的適應性與MoE的效率相統一,為資源感知的語音識別提供了一個可擴展且可解釋的解決方案。
English
Large language models (LLMs) have recently shown strong potential in
audio-visual speech recognition (AVSR), but their high computational demands
and sensitivity to token granularity limit their practicality in
resource-constrained settings. Token compression methods can reduce inference
cost, but they require fixing a compression rate in advance and produce a
single fixed-length output, offering no flexibility to balance information
density and efficiency at inference time. Matryoshka representation learning
(MRL) addresses this by enabling a single model to operate across multiple
token granularities, allowing compression rates to be adjusted dynamically.
However, current MRL-based methods treat each scale independently during
training, limiting cross-scale generalization, robustness at high compression,
and interpretability. To overcome these limitations, we propose MoME (Mixture
of Matryoshka Experts), a novel framework that integrates sparse
Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen
LLM with top-k routed and shared experts, allowing dynamic capacity allocation
across scales and modalities. A shared router promotes consistent expert
activation across granularities, enabling compressed sequences to benefit from
representations learned at lower compression. Experiments on LRS2 and LRS3
demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR,
and VSR tasks, while requiring significantly fewer parameters and maintaining
robustness under noise. MoME unifies the adaptability of MRL with the
efficiency of MoE, offering a scalable and interpretable solution for
resource-aware speech recognition.