FoleyGen：视觉引导音频生成

摘要

最近音频生成方面的进展得益于大规模深度学习模型和庞大数据集的发展。然而，视频到音频（V2A）生成任务仍然是一个挑战，主要是因为高维视觉和听觉数据之间错综复杂的关系，以及与时间同步相关的挑战。在这项研究中，我们介绍了FoleyGen，一个基于语言建模范式构建的开放领域V2A生成系统。FoleyGen利用现成的神经音频编解码器实现波形和离散标记之间的双向转换。音频标记的生成由一个单一Transformer模型实现，该模型以从视觉编码器提取的视觉特征为条件。V2A生成中一个普遍的问题是生成的音频与视频中可见动作之间的不对齐。为解决这一问题，我们探索了三种新颖的视觉注意机制。我们进一步对多个视觉编码器进行了详尽评估，每个编码器都是在单模态或多模态任务上进行预训练的。在VGGSound数据集上的实验结果显示，我们提出的FoleyGen在所有客观指标和人类评估中均优于先前的系统。

English

Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations.

FoleyGen：视觉引导音频生成

FoleyGen: Visually-Guided Audio Generation

摘要

Support