ChatPaper.aiChatPaper

基於離散唇語語義與多尺度全局-局部注意力機制的高效音視覺語音分離

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

September 28, 2025
作者: Kai Li, Kejun Gao, Xiaolin Hu
cs.AI

摘要

視聽語音分離(AVSS)方法利用視覺線索來提取目標語音,並在嘈雜的聲學環境中展現了卓越的分離質量。然而,這些方法通常涉及大量參數且計算成本高昂,這在許多應用中是不可接受的,尤其是當語音分離僅作為進一步語音處理的預處理步驟時。為解決這一問題,我們提出了一種高效的AVSS方法,名為Dolphin。在視覺特徵提取方面,我們開發了DP-LipCoder,這是一種雙路徑輕量級視頻編碼器,能將唇部運動轉化為離散的音頻對齊語義標記。在音頻分離方面,我們構建了一個輕量級的編碼器-解碼器分離器,其中每一層都包含一個全局-局部注意力(GLA)模塊,以高效捕捉多尺度依賴關係。在三個基準數據集上的實驗表明,Dolphin不僅在分離質量上超越了當前最先進(SOTA)模型,還在效率上取得了顯著提升:參數量減少超過50%,MACs降低超過2.4倍,GPU推理速度加快超過6倍。這些結果表明,Dolphin為現實場景中的高性能AVSS提供了一個實用且可部署的解決方案。我們的代碼和演示頁面已公開於http://cslikai.cn/Dolphin/。
English
Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.
PDF81October 1, 2025