MoME：面向视听语音识别的套娃专家混合模型

摘要

大型语言模型（LLMs）近期在视听语音识别（AVSR）领域展现出显著潜力，但其高计算需求及对令牌粒度的敏感性限制了其在资源受限环境中的实用性。令牌压缩方法虽能降低推理成本，却需预先固定压缩率并生成单一固定长度输出，无法在推理时灵活平衡信息密度与效率。套娃表示学习（MRL）通过使单一模型适应多种令牌粒度，实现了压缩率的动态调整，从而解决了这一问题。然而，现有基于MRL的方法在训练时独立处理各尺度，限制了跨尺度泛化能力、高压缩下的鲁棒性及可解释性。为克服这些局限，我们提出MoME（套娃专家混合），一种将稀疏专家混合（MoE）融入基于MRL的LLMs用于AVSR的新框架。MoME通过引入top-k路由与共享专家，增强了冻结LLM，实现了跨尺度与模态的动态容量分配。共享路由器促进了不同粒度间专家激活的一致性，使压缩序列能受益于低压缩下学到的表示。在LRS2和LRS3上的实验表明，MoME在AVSR、ASR及VSR任务中均达到最先进性能，同时显著减少参数需求并在噪声环境下保持鲁棒性。MoME将MRL的适应性与MoE的效率统一，为资源感知的语音识别提供了一个可扩展且可解释的解决方案。

English

Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

MoME：面向视听语音识别的套娃专家混合模型

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

摘要

Support