ChatPaper.aiChatPaper

Zero-AVSR:通過學習語言無關的語音表徵,利用大型語言模型實現零樣本音視頻語音識別

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

March 8, 2025
作者: Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro
cs.AI

摘要

我們探索了一種新穎的零樣本音頻-視覺語音識別(AVSR)框架,名為Zero-AVSR,該框架能夠在目標語言中進行語音識別,而無需這些語言的任何音頻-視覺語音數據。具體而言,我們引入了音頻-視覺語音羅馬化器(AV-Romanizer),它通過預測羅馬文本來學習語言無關的語音表示。然後,利用大型語言模型(LLMs)強大的多語言建模能力,我們提出將預測的羅馬文本轉換為特定語言的字符,形成所提出的級聯Zero-AVSR。更進一步,我們探索了一種統一的Zero-AVSR方法,通過直接將AV-Romanizer編碼的音頻-視覺語音表示整合到LLM中。這是通過使用我們提出的多任務學習方案微調適配器和LLM來實現的。為了捕捉廣泛的語音和語言多樣性,我們還引入了一個多語言音頻-視覺羅馬化語料庫(MARC),該語料庫包含82種語言的2,916小時音頻-視覺語音數據,以及特定語言字符和羅馬文本的轉錄。廣泛的分析和實驗證實,所提出的Zero-AVSR框架具有擴展語言支持的潛力,超越AV-Romanizer訓練期間所見的語言。
English
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.

Summary

AI-Generated Summary

PDF52March 11, 2025