高效音视频语音分离:基于离散唇语语义与多尺度全局-局部注意力机制
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
September 28, 2025
作者: Kai Li, Kejun Gao, Xiaolin Hu
cs.AI
摘要
视听语音分离(AVSS)方法通过利用视觉线索来提取目标语音,在嘈杂的声学环境中展现了卓越的分离质量。然而,这些方法通常涉及大量参数且计算成本高昂,这在许多应用中难以接受,尤其是当语音分离仅作为后续语音处理的预处理步骤时。为解决这一问题,我们提出了一种高效的AVSS方法,命名为Dolphin。在视觉特征提取方面,我们开发了DP-LipCoder,一种双路径轻量级视频编码器,将唇部运动转化为离散的音频对齐语义标记。在音频分离方面,我们构建了一个轻量级的编码-解码分离器,其中每一层都集成了全局-局部注意力(GLA)模块,以高效捕捉多尺度依赖关系。在三个基准数据集上的实验表明,Dolphin不仅在分离质量上超越了当前最先进的(SOTA)模型,还在效率上实现了显著提升:参数数量减少超过50%,MACs降低超过2.4倍,GPU推理速度加快超过6倍。这些结果表明,Dolphin为现实世界中的高性能AVSS提供了一个实用且可部署的解决方案。我们的代码和演示页面已公开于http://cslikai.cn/Dolphin/。
English
Audio-visual speech separation (AVSS) methods leverage visual cues to extract
target speech and have demonstrated strong separation quality in noisy acoustic
environments. However, these methods usually involve a large number of
parameters and require high computational cost, which is unacceptable in many
applications where speech separation serves as only a preprocessing step for
further speech processing. To address this issue, we propose an efficient AVSS
method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a
dual-path lightweight video encoder that transforms lip-motion into discrete
audio-aligned semantic tokens. For audio separation, we construct a lightweight
encoder-decoder separator, in which each layer incorporates a global-local
attention (GLA) block to efficiently capture multi-scale dependencies.
Experiments on three benchmark datasets showed that Dolphin not only surpassed
the current state-of-the-art (SOTA) model in separation quality but also
achieved remarkable improvements in efficiency: over 50% fewer parameters, more
than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These
results indicate that Dolphin offers a practical and deployable solution for
high-performance AVSS in real-world scenarios. Our code and demo page are
publicly available at http://cslikai.cn/Dolphin/.