USAD: 증류를 통한 범용 음성 및 오디오 표현 학습

초록

자기 지도 학습(Self-supervised learning, SSL)은 오디오 표현 분야에 혁신을 가져왔지만, 모델들은 여전히 음성 또는 비음성 작업 중 하나에 특화된 도메인별로 제한되는 경우가 많습니다. 본 연구에서는 다양한 오디오 유형(음성, 소리, 음악)을 단일 모델로 통합하는 통합 오디오 표현 학습 접근법인 Universal Speech and Audio Distillation(USAD)을 제안합니다. USAD는 도메인별 SSL 모델로부터 효율적인 계층 간 지식 증류를 활용하여 포괄적인 오디오 데이터셋으로 학생 모델을 학습시킵니다. USAD는 프레임 및 인스턴스 수준의 음성 처리 작업, 오디오 태깅, 소리 분류 등 다양한 벤치마크와 데이터셋에서 경쟁력 있는 성능을 보이며, SUPERB 및 HEAR 벤치마크에서 단일 인코더를 사용하여 거의 최신 기술 수준의 결과를 달성합니다.

English

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

USAD: 증류를 통한 범용 음성 및 오디오 표현 학습

USAD: Universal Speech and Audio Representation via Distillation

초록

Support