ChatPaper.aiChatPaper

扩展与增强基于LLM的AVSR:稀疏投影器混合方法

Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach

May 20, 2025
作者: Umberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, Alessio Brutti
cs.AI

摘要

视听语音识别(AVSR)通过整合视觉线索,增强了在嘈杂环境中的鲁棒性。尽管近期研究将大型语言模型(LLMs)融入AVSR,但其高昂的计算成本阻碍了在资源受限场景中的部署。为此,我们提出了Llama-SMoP,一种高效的多模态LLM,它采用稀疏投影器混合(SMoP)模块,在不增加推理成本的情况下扩展模型容量。通过引入稀疏门控的专家混合(MoE)投影器,Llama-SMoP使得使用较小规模的LLM成为可能,同时保持强劲性能。我们探索了三种SMoP配置,并证明Llama-SMoP DEDR(分离专家,分离路由器)——采用模态特定路由器和专家——在ASR、VSR及AVSR任务上实现了卓越性能。消融研究验证了其在专家激活、可扩展性及噪声鲁棒性方面的有效性。
English
Audio-Visual Speech Recognition (AVSR) enhances robustness in noisy environments by integrating visual cues. While recent advances integrate Large Language Models (LLMs) into AVSR, their high computational cost hinders deployment in resource-constrained settings. To address this, we propose Llama-SMoP, an efficient Multimodal LLM that employs a Sparse Mixture of Projectors (SMoP) module to scale model capacity without increasing inference costs. By incorporating sparsely-gated mixture-of-experts (MoE) projectors, Llama-SMoP enables the use of smaller LLMs while maintaining strong performance. We explore three SMoP configurations and show that Llama-SMoP DEDR (Disjoint-Experts, Disjoint-Routers), which uses modality-specific routers and experts, achieves superior performance on ASR, VSR, and AVSR tasks. Ablation studies confirm its effectiveness in expert activation, scalability, and noise robustness.

Summary

AI-Generated Summary

PDF22May 22, 2025