基於離散唇語語義與多尺度全局-局部注意力機制的高效音視覺語音分離

摘要

視聽語音分離（AVSS）方法利用視覺線索來提取目標語音，並在嘈雜的聲學環境中展現了卓越的分離質量。然而，這些方法通常涉及大量參數且計算成本高昂，這在許多應用中是不可接受的，尤其是當語音分離僅作為進一步語音處理的預處理步驟時。為解決這一問題，我們提出了一種高效的AVSS方法，名為Dolphin。在視覺特徵提取方面，我們開發了DP-LipCoder，這是一種雙路徑輕量級視頻編碼器，能將唇部運動轉化為離散的音頻對齊語義標記。在音頻分離方面，我們構建了一個輕量級的編碼器-解碼器分離器，其中每一層都包含一個全局-局部注意力（GLA）模塊，以高效捕捉多尺度依賴關係。在三個基準數據集上的實驗表明，Dolphin不僅在分離質量上超越了當前最先進（SOTA）模型，還在效率上取得了顯著提升：參數量減少超過50%，MACs降低超過2.4倍，GPU推理速度加快超過6倍。這些結果表明，Dolphin為現實場景中的高性能AVSS提供了一個實用且可部署的解決方案。我們的代碼和演示頁面已公開於http://cslikai.cn/Dolphin/。

English

Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.

基於離散唇語語義與多尺度全局-局部注意力機制的高效音視覺語音分離

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

摘要

Support