通过扩散模型实现多样化和高效的音频字幕化

摘要

我们介绍了基于扩散的音频字幕生成（DAC），这是一种专为多样化和高效的音频字幕生成定制的非自回归扩散模型。尽管现有依赖于语言骨干的字幕生成模型在各种字幕生成任务中取得了显著成功，但它们在生成速度和多样性方面的表现不足阻碍了音频理解和多媒体应用的进展。我们基于扩散的框架提供了独特优势，源自于其固有的随机性和在字幕生成中的整体上下文建模。通过严格的评估，我们证明DAC不仅在字幕质量方面实现了与现有基准相比的最先进性能水平，而且在生成速度和多样性方面显著优于它们。DAC的成功表明，使用扩散骨干，文本生成也可以与音频和视觉生成任务无缝集成，为跨不同模态的统一音频相关生成模型铺平道路。

English

We introduce Diffusion-based Audio Captioning (DAC), a non-autoregressive diffusion model tailored for diverse and efficient audio captioning. Although existing captioning models relying on language backbones have achieved remarkable success in various captioning tasks, their insufficient performance in terms of generation speed and diversity impede progress in audio understanding and multimedia applications. Our diffusion-based framework offers unique advantages stemming from its inherent stochasticity and holistic context modeling in captioning. Through rigorous evaluation, we demonstrate that DAC not only achieves SOTA performance levels compared to existing benchmarks in the caption quality, but also significantly outperforms them in terms of generation speed and diversity. The success of DAC illustrates that text generation can also be seamlessly integrated with audio and visual generation tasks using a diffusion backbone, paving the way for a unified, audio-related generative model across different modalities.

通过扩散模型实现多样化和高效的音频字幕化

Towards Diverse and Efficient Audio Captioning via Diffusion Models

摘要

Support