이산적 입술 의미론과 다중 스케일 글로벌-로컬 주의 메커니즘을 활용한 효율적인 오디오-비주얼 음성 분리

초록

오디오-비주얼 음성 분리(AVSS) 방법은 시각적 단서를 활용하여 대상 음성을 추출하며, 잡음이 많은 음향 환경에서 강력한 분리 품질을 입증했습니다. 그러나 이러한 방법은 일반적으로 많은 수의 매개변수를 포함하고 높은 계산 비용을 요구하며, 이는 음성 분리가 추가 음성 처리를 위한 전처리 단계로만 사용되는 많은 응용 프로그램에서 받아들일 수 없습니다. 이 문제를 해결하기 위해 우리는 Dolphin이라는 효율적인 AVSS 방법을 제안합니다. 시각적 특징 추출을 위해, 우리는 입술 움직임을 이산적인 오디오 정렬 의미 토큰으로 변환하는 이중 경로 경량 비디오 인코더인 DP-LipCoder를 개발했습니다. 오디오 분리를 위해, 우리는 각 레이어가 다중 스케일 의존성을 효율적으로 포착하기 위해 글로벌-로컬 어텐션(GLA) 블록을 통합한 경량 인코더-디코더 분리기를 구성했습니다. 세 가지 벤치마크 데이터셋에서의 실험 결과, Dolphin은 분리 품질에서 현재 최첨단(SOTA) 모델을 능가했을 뿐만 아니라 효율성에서도 놀라운 개선을 달성했습니다: 50% 이상 적은 매개변수, 2.4배 이상의 MACs 감소, 그리고 6배 이상 빠른 GPU 추론 속도를 보였습니다. 이러한 결과는 Dolphin이 실제 시나리오에서 고성능 AVSS를 위한 실용적이고 배포 가능한 솔루션을 제공함을 나타냅니다. 우리의 코드와 데모 페이지는 http://cslikai.cn/Dolphin/에서 공개적으로 이용 가능합니다.

English

Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.