扩展与增强基于LLM的AVSR:一种稀疏投影器混合方法
Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach
May 20, 2025
作者: Umberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, Alessio Brutti
cs.AI
摘要
視聽語音辨識(AVSR)通過整合視覺線索,在嘈雜環境中提升了辨識的穩健性。儘管近期研究將大型語言模型(LLMs)融入AVSR,但其高昂的計算成本阻礙了在資源受限環境中的部署。為解決此問題,我們提出了Llama-SMoP,這是一種高效的多模態LLM,採用稀疏投影器混合(SMoP)模組來擴展模型能力,而不增加推理成本。通過引入稀疏門控的專家混合(MoE)投影器,Llama-SMoP使得使用較小的LLMs成為可能,同時保持強勁的性能。我們探討了三種SMoP配置,並展示了Llama-SMoP DEDR(分離專家,分離路由器),即使用模態專屬路由器和專家的配置,在ASR、VSR及AVSR任務上均取得了優異表現。消融研究證實了其在專家激活、可擴展性及噪音穩健性方面的有效性。
English
Audio-Visual Speech Recognition (AVSR) enhances robustness in noisy
environments by integrating visual cues. While recent advances integrate Large
Language Models (LLMs) into AVSR, their high computational cost hinders
deployment in resource-constrained settings. To address this, we propose
Llama-SMoP, an efficient Multimodal LLM that employs a Sparse Mixture of
Projectors (SMoP) module to scale model capacity without increasing inference
costs. By incorporating sparsely-gated mixture-of-experts (MoE) projectors,
Llama-SMoP enables the use of smaller LLMs while maintaining strong
performance. We explore three SMoP configurations and show that Llama-SMoP DEDR
(Disjoint-Experts, Disjoint-Routers), which uses modality-specific routers and
experts, achieves superior performance on ASR, VSR, and AVSR tasks. Ablation
studies confirm its effectiveness in expert activation, scalability, and noise
robustness.Summary
AI-Generated Summary