USAD:通过蒸馏实现通用语音与音频表征
USAD: Universal Speech and Audio Representation via Distillation
June 23, 2025
作者: Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu
cs.AI
摘要
自监督学习(SSL)已彻底革新了音频表征领域,然而现有模型往往局限于特定领域,专注于语音或非语音任务。本研究提出了一种统一的音频表征学习方法——通用语音与音频蒸馏(USAD),该方法将多样化的音频类型——语音、声音和音乐——整合至单一模型中。USAD通过从领域特定的SSL模型进行高效的层间蒸馏,在一个综合音频数据集上训练学生模型。USAD在多种基准测试和数据集上展现出竞争力,包括帧级和实例级语音处理任务、音频标注及声音分类,在SUPERB和HEAR基准测试中,仅使用单一编码器即达到了接近最先进水平的结果。
English
Self-supervised learning (SSL) has revolutionized audio representations, yet
models often remain domain-specific, focusing on either speech or non-speech
tasks. In this work, we present Universal Speech and Audio Distillation (USAD),
a unified approach to audio representation learning that integrates diverse
audio types - speech, sound, and music - into a single model. USAD employs
efficient layer-to-layer distillation from domain-specific SSL models to train
a student on a comprehensive audio dataset. USAD offers competitive performance
across various benchmarks and datasets, including frame and instance-level
speech processing tasks, audio tagging, and sound classification, achieving
near state-of-the-art results with a single encoder on SUPERB and HEAR
benchmarks.