MoME: オーディオビジュアル音声認識のためのマトリョーシカエキスパートの混合モデル

要旨

大規模言語モデル（LLMs）は、最近、視聴覚音声認識（AVSR）において強い可能性を示しているが、その高い計算要求とトークンの粒度に対する感度が、リソースが制限された環境での実用性を制限している。トークン圧縮法は推論コストを削減できるが、事前に圧縮率を固定する必要があり、単一の固定長出力を生成するため、推論時に情報密度と効率のバランスを柔軟に調整することができない。マトリョーシカ表現学習（MRL）は、単一のモデルが複数のトークン粒度で動作できるようにすることで、圧縮率を動的に調整可能にする。しかし、現在のMRLベースの手法は、トレーニング中に各スケールを独立して扱うため、スケール間の汎化性、高圧縮時のロバスト性、および解釈可能性が制限される。これらの制限を克服するため、我々はMoME（Mixture of Matryoshka Experts）を提案する。これは、AVSR向けのMRLベースのLLMsにスパースなMixture-of-Experts（MoE）を統合する新しいフレームワークである。MoMEは、固定されたLLMにtop-kルーティングされた共有エキスパートを追加し、スケールとモダリティ間で動的なキャパシティ割り当てを可能にする。共有ルーターは、粒度間で一貫したエキスパートの活性化を促進し、圧縮されたシーケンスが低圧縮で学習された表現の恩恵を受けられるようにする。LRS2およびLRS3での実験により、MoMEがAVSR、ASR、およびVSRタスクにおいて最先端の性能を達成し、大幅に少ないパラメータでノイズ下でのロバスト性を維持することが示された。MoMEは、MRLの適応性とMoEの効率性を統合し、リソースを意識した音声認識のためのスケーラブルで解釈可能なソリューションを提供する。

English

Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.

MoME: オーディオビジュアル音声認識のためのマトリョーシカエキスパートの混合モデル

MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

要旨

Support