通过扩散模型实现多样化和高效的音频字幕化
Towards Diverse and Efficient Audio Captioning via Diffusion Models
September 14, 2024
作者: Manjie Xu, Chenxing Li, Xinyi Tu, Yong Ren, Ruibo Fu, Wei Liang, Dong Yu
cs.AI
摘要
我们介绍了基于扩散的音频字幕生成(DAC),这是一种专为多样化和高效的音频字幕生成定制的非自回归扩散模型。尽管现有依赖于语言骨干的字幕生成模型在各种字幕生成任务中取得了显著成功,但它们在生成速度和多样性方面的表现不足阻碍了音频理解和多媒体应用的进展。我们基于扩散的框架提供了独特优势,源自于其固有的随机性和在字幕生成中的整体上下文建模。通过严格的评估,我们证明DAC不仅在字幕质量方面实现了与现有基准相比的最先进性能水平,而且在生成速度和多样性方面显著优于它们。DAC的成功表明,使用扩散骨干,文本生成也可以与音频和视觉生成任务无缝集成,为跨不同模态的统一音频相关生成模型铺平道路。
English
We introduce Diffusion-based Audio Captioning (DAC), a non-autoregressive
diffusion model tailored for diverse and efficient audio captioning. Although
existing captioning models relying on language backbones have achieved
remarkable success in various captioning tasks, their insufficient performance
in terms of generation speed and diversity impede progress in audio
understanding and multimedia applications. Our diffusion-based framework offers
unique advantages stemming from its inherent stochasticity and holistic context
modeling in captioning. Through rigorous evaluation, we demonstrate that DAC
not only achieves SOTA performance levels compared to existing benchmarks in
the caption quality, but also significantly outperforms them in terms of
generation speed and diversity. The success of DAC illustrates that text
generation can also be seamlessly integrated with audio and visual generation
tasks using a diffusion backbone, paving the way for a unified, audio-related
generative model across different modalities.Summary
AI-Generated Summary