大規模な音声-言語表現学習用データセット

要旨

AIコミュニティは、大規模なマルチモーダルデータセットを原動力として、強力な基盤モデルの開発において大きな進展を遂げてきました。しかし、音声表現学習の分野では、現在の音声-言語データセットは、データ量の不足、内容の単純さ、収集プロセスの煩雑さといった制約に直面しています。これらの課題に対処するため、我々は一連の公開ツールやAPIを基盤とした革新的で自動化された音声キャプション生成パイプラインを提案し、Auto-ACDと名付けた大規模で高品質な音声-言語データセットを構築しました。このデータセットは190万以上の音声-テキストペアで構成されています。提案されたデータセットの有効性を実証するため、我々は人気のあるモデルをこのデータセットで学習させ、音声-言語検索、音声キャプショニング、環境分類といった様々な下流タスクにおいて性能向上を示しました。さらに、我々は新たなテストセットを確立し、音声-テキストタスクのためのベンチマークを提供します。提案されたデータセットはhttps://auto-acd.github.io/で公開される予定です。

English

The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at https://auto-acd.github.io/.

大規模な音声-言語表現学習用データセット

A Large-scale Dataset for Audio-Language Representation Learning

要旨

Support