FoleyGen：視覺引導音頻生成

摘要

近年來，音訊生成方面的最新進展受益於大規模深度學習模型和龐大數據集的演進。然而，視訊轉音訊（V2A）生成任務仍然是一項挑戰，主要是因為高維視覺和聽覺數據之間錯綜複雜的關係，以及與時間同步相關的挑戰。在本研究中，我們介紹了FoleyGen，一個基於語言建模範式構建的開放域V2A生成系統。FoleyGen利用現成的神經音訊編解碼器進行波形和離散標記之間的雙向轉換。音訊標記的生成由一個單一Transformer模型進行，該模型受到從視覺編碼器提取的視覺特徵的條件約束。V2A生成中一個普遍的問題是生成的音訊與視頻中可見動作之間的不對齊。為了解決這個問題，我們探索了三種新穎的視覺注意機制。我們進一步對多個視覺編碼器進行了全面評估，每個編碼器都是在單模態或多模態任務上預訓練的。對VGGSound數據集的實驗結果表明，我們提出的FoleyGen在所有客觀指標和人類評估中均優於先前的系統。

English

Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations.

FoleyGen：視覺引導音頻生成

FoleyGen: Visually-Guided Audio Generation

摘要

Support