LLMベースのAVSRのスケーリングと強化：スパースなプロジェクタの混合アプローチ

要旨

視聴覚音声認識（AVSR）は、視覚的な手がかりを統合することで、ノイズの多い環境における頑健性を向上させます。近年の進展では、大規模言語モデル（LLM）をAVSRに統合する試みが行われていますが、その高い計算コストがリソースに制約のある環境での展開を妨げています。この課題に対処するため、我々はLlama-SMoPを提案します。これは、推論コストを増加させることなくモデルの容量を拡張するために、Sparse Mixture of Projectors（SMoP）モジュールを採用した効率的なマルチモーダルLLMです。疎にゲートされた専門家混合（MoE）プロジェクターを組み込むことで、Llama-SMoPはより小さなLLMを使用しながらも高い性能を維持します。我々は3つのSMoP構成を検討し、モダリティ固有のルーターと専門家を使用するLlama-SMoP DEDR（Disjoint-Experts, Disjoint-Routers）が、ASR、VSR、およびAVSRタスクにおいて優れた性能を発揮することを示します。アブレーション研究により、専門家の活性化、スケーラビリティ、およびノイズに対する頑健性の有効性が確認されました。

English

Audio-Visual Speech Recognition (AVSR) enhances robustness in noisy environments by integrating visual cues. While recent advances integrate Large Language Models (LLMs) into AVSR, their high computational cost hinders deployment in resource-constrained settings. To address this, we propose Llama-SMoP, an efficient Multimodal LLM that employs a Sparse Mixture of Projectors (SMoP) module to scale model capacity without increasing inference costs. By incorporating sparsely-gated mixture-of-experts (MoE) projectors, Llama-SMoP enables the use of smaller LLMs while maintaining strong performance. We explore three SMoP configurations and show that Llama-SMoP DEDR (Disjoint-Experts, Disjoint-Routers), which uses modality-specific routers and experts, achieves superior performance on ASR, VSR, and AVSR tasks. Ablation studies confirm its effectiveness in expert activation, scalability, and noise robustness.

LLMベースのAVSRのスケーリングと強化：スパースなプロジェクタの混合アプローチ

Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach

要旨

Support