MoME:面向视听语音识别的套娃专家混合模型
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
October 5, 2025
作者: Umberto Cappellazzo, Minsu Kim, Pingchuan Ma, Honglie Chen, Xubo Liu, Stavros Petridis, Maja Pantic
cs.AI
摘要
大型语言模型(LLMs)近期在视听语音识别(AVSR)领域展现出显著潜力,但其高计算需求及对令牌粒度的敏感性限制了其在资源受限环境中的实用性。令牌压缩方法虽能降低推理成本,却需预先固定压缩率并生成单一固定长度输出,无法在推理时灵活平衡信息密度与效率。套娃表示学习(MRL)通过使单一模型适应多种令牌粒度,实现了压缩率的动态调整,从而解决了这一问题。然而,现有基于MRL的方法在训练时独立处理各尺度,限制了跨尺度泛化能力、高压缩下的鲁棒性及可解释性。为克服这些局限,我们提出MoME(套娃专家混合),一种将稀疏专家混合(MoE)融入基于MRL的LLMs用于AVSR的新框架。MoME通过引入top-k路由与共享专家,增强了冻结LLM,实现了跨尺度与模态的动态容量分配。共享路由器促进了不同粒度间专家激活的一致性,使压缩序列能受益于低压缩下学到的表示。在LRS2和LRS3上的实验表明,MoME在AVSR、ASR及VSR任务中均达到最先进性能,同时显著减少参数需求并在噪声环境下保持鲁棒性。MoME将MRL的适应性与MoE的效率统一,为资源感知的语音识别提供了一个可扩展且可解释的解决方案。
English
Large language models (LLMs) have recently shown strong potential in
audio-visual speech recognition (AVSR), but their high computational demands
and sensitivity to token granularity limit their practicality in
resource-constrained settings. Token compression methods can reduce inference
cost, but they require fixing a compression rate in advance and produce a
single fixed-length output, offering no flexibility to balance information
density and efficiency at inference time. Matryoshka representation learning
(MRL) addresses this by enabling a single model to operate across multiple
token granularities, allowing compression rates to be adjusted dynamically.
However, current MRL-based methods treat each scale independently during
training, limiting cross-scale generalization, robustness at high compression,
and interpretability. To overcome these limitations, we propose MoME (Mixture
of Matryoshka Experts), a novel framework that integrates sparse
Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen
LLM with top-k routed and shared experts, allowing dynamic capacity allocation
across scales and modalities. A shared router promotes consistent expert
activation across granularities, enabling compressed sequences to benefit from
representations learned at lower compression. Experiments on LRS2 and LRS3
demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR,
and VSR tasks, while requiring significantly fewer parameters and maintaining
robustness under noise. MoME unifies the adaptability of MRL with the
efficiency of MoE, offering a scalable and interpretable solution for
resource-aware speech recognition.