高效音视频语音分离：基于离散唇语语义与多尺度全局-局部注意力机制

摘要

视听语音分离（AVSS）方法通过利用视觉线索来提取目标语音，在嘈杂的声学环境中展现了卓越的分离质量。然而，这些方法通常涉及大量参数且计算成本高昂，这在许多应用中难以接受，尤其是当语音分离仅作为后续语音处理的预处理步骤时。为解决这一问题，我们提出了一种高效的AVSS方法，命名为Dolphin。在视觉特征提取方面，我们开发了DP-LipCoder，一种双路径轻量级视频编码器，将唇部运动转化为离散的音频对齐语义标记。在音频分离方面，我们构建了一个轻量级的编码-解码分离器，其中每一层都集成了全局-局部注意力（GLA）模块，以高效捕捉多尺度依赖关系。在三个基准数据集上的实验表明，Dolphin不仅在分离质量上超越了当前最先进的（SOTA）模型，还在效率上实现了显著提升：参数数量减少超过50%，MACs降低超过2.4倍，GPU推理速度加快超过6倍。这些结果表明，Dolphin为现实世界中的高性能AVSS提供了一个实用且可部署的解决方案。我们的代码和演示页面已公开于http://cslikai.cn/Dolphin/。

English

Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.

高效音视频语音分离：基于离散唇语语义与多尺度全局-局部注意力机制

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

摘要

Support