基於套娃式多模態大語言模型的自適應音視覺語音識別
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
March 9, 2025
作者: Umberto Cappellazzo, Minsu Kim, Stavros Petridis
cs.AI
摘要
視聽語音識別(AVSR)通過結合音頻和視覺模態來增強語音識別的魯棒性,特別是在嘈雜環境中。近年來,大型語言模型(LLMs)在語音識別領域的進展,包括AVSR,已展現出其卓越效能。然而,由於語音表徵的長度顯著,直接與LLMs整合會帶來巨大的計算成本。先前的方法通過在將語音表徵輸入LLMs之前進行壓縮來解決這一問題。然而,較高的壓縮率往往導致性能下降,這需要在計算效率和識別準確性之間做出權衡。為應對這一挑戰,我們提出了Llama-MTSK,這是首個基於Matryoshka的多模態LLM用於AVSR,它能夠根據特定的計算約束靈活調整音視覺令牌的分配,同時保持高性能。我們的方法受Matryoshka表示學習啟發,在單一模型內以多種粒度編碼音視覺表徵,無需為不同壓縮級別訓練獨立模型。此外,為了高效微調LLM,我們引入了三種基於LoRA的Matryoshka策略,利用全局和特定尺度的LoRA模塊。在兩個最大的AVSR數據集上的廣泛評估表明,Llama-MTSK達到了最先進的成果,與在固定壓縮級別下獨立訓練的模型相比,表現相當或更優。
English
Audio-Visual Speech Recognition (AVSR) leverages both audio and visual
modalities to enhance speech recognition robustness, particularly in noisy
environments. Recent advancements in Large Language Models (LLMs) have
demonstrated their effectiveness in speech recognition, including AVSR.
However, due to the significant length of speech representations, direct
integration with LLMs imposes substantial computational costs. Prior approaches
address this by compressing speech representations before feeding them into
LLMs. However, higher compression ratios often lead to performance degradation,
necessitating a trade-off between computational efficiency and recognition
accuracy. To address this challenge, we propose Llama-MTSK, the first
Matryoshka-based Multimodal LLM for AVSR, which enables flexible adaptation of
the audio-visual token allocation based on specific computational constraints
while preserving high performance. Our approach, inspired by Matryoshka
Representation Learning, encodes audio-visual representations at multiple
granularities within a single model, eliminating the need to train separate
models for different compression levels. Moreover, to efficiently fine-tune the
LLM, we introduce three LoRA-based Matryoshka strategies using global and
scale-specific LoRA modules. Extensive evaluations on the two largest AVSR
datasets demonstrate that Llama-MTSK achieves state-of-the-art results,
matching or surpassing models trained independently at fixed compression
levels.Summary
AI-Generated Summary