ChatPaper.aiChatPaper

Synthio:使用合成数据增强小规模音频分类数据集

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

October 2, 2024
作者: Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha
cs.AI

摘要

我们提出了Synthio,这是一种新颖的方法,用于通过合成数据来增强小规模音频分类数据集。我们的目标是在有限标记数据的情况下提高音频分类的准确性。传统的数据增强技术,如应用人工转换(例如添加随机噪音或遮蔽片段),很难创建能够捕捉真实世界音频中存在的真实多样性的数据。为了解决这一缺点,我们提出使用从文本到音频(T2A)扩散模型生成的合成音频来增强数据集。然而,合成有效的增强是具有挑战性的,因为生成的数据不仅应该在声学上与基础小规模数据集保持一致,而且还应具有足够的组成多样性。为了克服第一个挑战,我们使用偏好优化将T2A模型的生成与小规模数据集对齐。这确保了生成数据的声学特征与小规模数据集保持一致。为了解决第二个挑战,我们提出了一种新颖的标题生成技术,利用大型语言模型的推理能力来(1)生成多样且有意义的音频标题,以及(2)迭代地提高它们的质量。生成的标题然后用于提示对齐的T2A模型。我们在十个数据集和四个模拟有限数据设置上对Synthio进行了广泛评估。结果表明,我们的方法在仅在弱标记的AudioSet上训练的T2A模型上始终优于所有基线方法,提高了0.1%-39%。
English
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.

Summary

AI-Generated Summary

PDF62November 16, 2024