ChatPaper.aiChatPaper

高效音视频语音分离:基于离散唇语语义与多尺度全局-局部注意力机制

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

September 28, 2025
作者: Kai Li, Kejun Gao, Xiaolin Hu
cs.AI

摘要

视听语音分离(AVSS)方法通过利用视觉线索来提取目标语音,在嘈杂的声学环境中展现了卓越的分离质量。然而,这些方法通常涉及大量参数且计算成本高昂,这在许多应用中难以接受,尤其是当语音分离仅作为后续语音处理的预处理步骤时。为解决这一问题,我们提出了一种高效的AVSS方法,命名为Dolphin。在视觉特征提取方面,我们开发了DP-LipCoder,一种双路径轻量级视频编码器,将唇部运动转化为离散的音频对齐语义标记。在音频分离方面,我们构建了一个轻量级的编码-解码分离器,其中每一层都集成了全局-局部注意力(GLA)模块,以高效捕捉多尺度依赖关系。在三个基准数据集上的实验表明,Dolphin不仅在分离质量上超越了当前最先进的(SOTA)模型,还在效率上实现了显著提升:参数数量减少超过50%,MACs降低超过2.4倍,GPU推理速度加快超过6倍。这些结果表明,Dolphin为现实世界中的高性能AVSS提供了一个实用且可部署的解决方案。我们的代码和演示页面已公开于http://cslikai.cn/Dolphin/。
English
Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.
PDF41October 1, 2025