用於音訊語言表示學習的大規模數據集

摘要

在 AI 社群中，透過大規模多模態資料集驅動，已取得重大進展，發展出強大的基礎模型。然而，在音訊表示學習社群中，目前的音訊語言資料集存在著容量不足、內容過於簡單以及收集程序繁瑣等限制。為了應對這些挑戰，我們提出了一個創新且自動的音訊標題生成流程，基於一系列公共工具或 API，並建立了一個大規模、高品質的音訊語言資料集，名為 Auto-ACD，包含超過 1.9 百萬個音訊文本對。為了展示所提出資料集的有效性，我們在我們的資料集上訓練了流行的模型，並展示在各種下游任務上的性能改進，即音訊語言檢索、音訊標題生成、環境分類。此外，我們建立了一個新穎的測試集，並為音訊文本任務提供了一個基準。所提出的資料集將在 https://auto-acd.github.io/ 上發布。

English

The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at https://auto-acd.github.io/.

用於音訊語言表示學習的大規模數據集

A Large-scale Dataset for Audio-Language Representation Learning

摘要

Support