用于音频语言表示学习的大规模数据集

摘要

AI社区在开发强大的基础模型方面取得了重大进展，这些进展是由大规模多模态数据集驱动的。然而，在音频表示学习社区中，目前的音频-语言数据集存在诸如容量不足、内容过于简单和收集程序繁琐等限制。为了解决这些挑战，我们提出了一种基于一系列公共工具或API的创新自动音频字幕生成流程，并构建了一个名为Auto-ACD的大规模、高质量的音频-语言数据集，包括超过190万个音频-文本对。为了展示所提出数据集的有效性，我们在数据集上训练了流行模型，并展示了在各种下游任务上的性能改进，即音频-语言检索、音频字幕生成、环境分类。此外，我们建立了一个新颖的测试集，并为音频-文本任务提供了一个基准。所提出的数据集将在https://auto-acd.github.io/ 上发布。

English

The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at https://auto-acd.github.io/.

用于音频语言表示学习的大规模数据集

A Large-scale Dataset for Audio-Language Representation Learning

摘要

Support