범용 오디오 표현을 위한 자연어 지도 학습

초록

오디오-언어 모델은 멀티모달 텍스트와 오디오 표현을 공동으로 학습하여 제로샷 추론을 가능하게 합니다. 이 모델들은 인코더를 통해 입력의 강력한 표현을 생성하고, 소리, 음악, 음성 등 다양한 작업에 일반화합니다. 비록 모델들이 놀라운 성능을 달성했지만, 여전히 작업별 특화 모델과의 성능 격차가 존재합니다. 본 논문에서는 460만 개의 오디오-텍스트 쌍으로 구성된 다양한 데이터셋을 사용하여 두 가지 혁신적인 인코더로 사전 학습된 대조적 언어-오디오 사전 학습 모델을 제안합니다. 오디오 표현을 학습하기 위해, 우리는 기존의 소리 이벤트 분류 학습 대신 22개의 오디오 작업에 대해 오디오 인코더를 학습했습니다. 언어 표현을 학습하기 위해, 기존의 인코더 전용 모델 대신 자기회귀 디코더 전용 모델을 학습했습니다. 그런 다음, 오디오와 언어 표현은 대조 학습을 통해 공동의 멀티모달 공간으로 통합됩니다. 우리는 이 인코더들을 사용하여 다운스트림 작업의 성능을 크게 향상시켰습니다. 우리는 문헌상 가장 큰 규모인 26개의 다운스트림 작업에서 우리의 표현의 일반화 능력을 광범위하게 평가했습니다. 우리의 모델은 여러 작업에서 최첨단 결과를 달성하며, 범용 오디오 표현을 향한 길을 열었습니다.

English

Audio-Language models jointly learn multimodal text and audio representations that enable Zero-Shot inference. Models rely on the encoders to create powerful representations of the input and generalize to multiple tasks ranging from sounds, music, and speech. Although models have achieved remarkable performance, there is still a performance gap with task-specific models. In this paper, we propose a Contrastive Language-Audio Pretraining model that is pretrained with a diverse collection of 4.6M audio-text pairs employing two innovative encoders for Zero-Shot inference. To learn audio representations, we trained an audio encoder on 22 audio tasks, instead of the standard training of sound event classification. To learn language representations, we trained an autoregressive decoder-only model instead of the standard encoder-only models. Then, the audio and language representations are brought into a joint multimodal space using Contrastive Learning. We used our encoders to improve the downstream performance by a margin. We extensively evaluated the generalization of our representations on 26 downstream tasks, the largest in the literature. Our model achieves state of the art results in several tasks leading the way towards general-purpose audio representations.

범용 오디오 표현을 위한 자연어 지도 학습

Natural Language Supervision for General-Purpose Audio Representations

초록

Support