ChatPaper.aiChatPaper

USAD:基於蒸餾的通用語音與音頻表徵

USAD: Universal Speech and Audio Representation via Distillation

June 23, 2025
作者: Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu
cs.AI

摘要

自監督學習(SSL)已革新了音頻表徵技術,然而模型往往仍局限於特定領域,專注於語音或非語音任務。本研究提出了一種統一音頻表徵學習方法——通用語音與音頻蒸餾(USAD),該方法將多樣化的音頻類型——語音、聲音及音樂——整合至單一模型中。USAD通過從領域特定的SSL模型中進行高效的層間蒸餾,來訓練學生模型於一個全面的音頻數據集上。USAD在多種基準測試和數據集上展現出競爭力,涵蓋了幀級與實例級語音處理任務、音頻標籤及聲音分類,並在SUPERB和HEAR基準上以單一編碼器達到了接近最前沿的成果。
English
Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.
PDF91June 25, 2025