USAD：通过蒸馏实现通用语音与音频表征

摘要

自监督学习（SSL）已彻底革新了音频表征领域，然而现有模型往往局限于特定领域，专注于语音或非语音任务。本研究提出了一种统一的音频表征学习方法——通用语音与音频蒸馏（USAD），该方法将多样化的音频类型——语音、声音和音乐——整合至单一模型中。USAD通过从领域特定的SSL模型进行高效的层间蒸馏，在一个综合音频数据集上训练学生模型。USAD在多种基准测试和数据集上展现出竞争力，包括帧级和实例级语音处理任务、音频标注及声音分类，在SUPERB和HEAR基准测试中，仅使用单一编码器即达到了接近最先进水平的结果。

English

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

USAD：通过蒸馏实现通用语音与音频表征

USAD: Universal Speech and Audio Representation via Distillation

摘要

Support