MoME：面向音視語音辨識的套娃專家混合模型

摘要

大型語言模型（LLMs）近期在視聽語音識別（AVSR）領域展現出強大潛力，但其高計算需求及對詞元粒度的敏感性，限制了其在資源受限環境中的實用性。詞元壓縮方法雖能降低推理成本，卻需預先固定壓縮率並產生單一固定長度輸出，無法在推理時靈活平衡信息密度與效率。嵌套表示學習（MRL）通過使單一模型能在多種詞元粒度下運作，實現了壓縮率的動態調整，從而解決了這一問題。然而，現有的基於MRL的方法在訓練時將各尺度獨立處理，限制了跨尺度的泛化能力、高壓縮下的魯棒性及可解釋性。為克服這些限制，我們提出了MoME（嵌套專家混合），這是一種新穎框架，將稀疏專家混合（MoE）整合到基於MRL的LLMs中，用於AVSR。MoME通過頂級路由和共享專家增強了固定的LLM，實現了跨尺度和模態的動態容量分配。共享路由器促進了跨粒度的一致性專家激活，使壓縮序列能受益於低壓縮下學習到的表示。在LRS2和LRS3上的實驗表明，MoME在AVSR、ASR和VSR任務中均達到了最先進的性能，同時顯著減少了參數需求並在噪聲下保持了魯棒性。MoME將MRL的適應性與MoE的效率相統一，為資源感知的語音識別提供了一個可擴展且可解釋的解決方案。

English

Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

MoME：面向音視語音辨識的套娃專家混合模型

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

摘要

Support