MoME: 오디오-비주얼 음성 인식을 위한 마트료시카 전문가 혼합 모델

초록

대규모 언어 모델(LLMs)은 최근 오디오-비주얼 음성 인식(AVSR)에서 강력한 잠재력을 보여주었지만, 높은 계산 요구량과 토큰 세분화에 대한 민감도로 인해 자원이 제한된 환경에서의 실용성이 제한되고 있습니다. 토큰 압축 방법은 추론 비용을 줄일 수 있지만, 사전에 압축률을 고정해야 하고 단일 고정 길이 출력을 생성하기 때문에 추론 시 정보 밀도와 효율성 간의 균형을 유연하게 조정할 수 없습니다. 마트료시카 표현 학습(MRL)은 이를 해결하기 위해 단일 모델이 여러 토큰 세분화 수준에서 작동할 수 있도록 하여 압축률을 동적으로 조정할 수 있게 합니다. 그러나 현재의 MRL 기반 방법은 훈련 중 각 스케일을 독립적으로 처리하기 때문에 스케일 간 일반화, 높은 압축에서의 견고성, 그리고 해석 가능성이 제한됩니다. 이러한 한계를 극복하기 위해, 우리는 AVSR을 위한 MRL 기반 LLM에 희소 Mixture-of-Experts(MoE)를 통합한 새로운 프레임워크인 MoME(Mixture of Matryoshka Experts)를 제안합니다. MoME는 고정된 LLM에 top-k 라우팅 및 공유 전문가를 추가하여 스케일과 모달리티 간 동적 용량 할당을 가능하게 합니다. 공유 라우터는 세분화 수준 간 일관된 전문가 활성화를 촉진하여 압축된 시퀀스가 낮은 압축에서 학습된 표현의 이점을 얻을 수 있도록 합니다. LRS2와 LRS3에 대한 실험 결과, MoME는 AVSR, ASR, VSR 작업에서 최첨단 성능을 달성하면서도 상당히 적은 매개변수를 요구하고 노이즈 하에서도 견고성을 유지함을 보여줍니다. MoME는 MRL의 적응성과 MoE의 효율성을 통합하여 자원 인식 음성 인식을 위한 확장 가능하고 해석 가능한 솔루션을 제공합니다.

English

Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

MoME: 오디오-비주얼 음성 인식을 위한 마트료시카 전문가 혼합 모델

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

초록

Support