오디오-언어 표현 학습을 위한 대규모 데이터셋

초록

AI 커뮤니티는 대규모 멀티모달 데이터셋을 기반으로 강력한 파운데이션 모델을 개발하는 데 있어 상당한 진전을 이루어 왔습니다. 그러나 오디오 표현 학습 커뮤니티에서는 현재의 오디오-언어 데이터셋이 부족한 규모, 단순한 콘텐츠, 수집 과정의 번거로움과 같은 한계를 겪고 있습니다. 이러한 문제를 해결하기 위해, 우리는 일련의 공개 도구나 API를 기반으로 한 혁신적이고 자동화된 오디오 캡션 생성 파이프라인을 제안하고, 190만 개 이상의 오디오-텍스트 쌍으로 구성된 대규모 고품질 오디오-언어 데이터셋인 Auto-ACD를 구축했습니다. 제안된 데이터셋의 효과를 입증하기 위해, 우리는 인기 있는 모델들을 이 데이터셋으로 학습시키고 오디오-언어 검색, 오디오 캡셔닝, 환경 분류와 같은 다양한 다운스트림 작업에서의 성능 향상을 보여줍니다. 또한, 우리는 새로운 테스트 세트를 구축하고 오디오-텍스트 작업을 위한 벤치마크를 제공합니다. 제안된 데이터셋은 https://auto-acd.github.io/에서 공개될 예정입니다.

English

The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at https://auto-acd.github.io/.

오디오-언어 표현 학습을 위한 대규모 데이터셋

A Large-scale Dataset for Audio-Language Representation Learning

초록

Support