具有增強同步性的遮罩生成式視訊轉音頻Transformer

摘要

影片轉音頻（V2A）生成利用僅視覺影片特徵來呈現與場景相匹配的合理聲音。重要的是，生成的聲音起始應與與之對齊的視覺動作相匹配，否則將產生不自然的同步問題。最近的研究探索了在靜止圖像上條件化聲音生成器的進展，然後是視頻特徵，專注於質量和語義匹配，同時忽略同步，或者犧牲一定程度的質量來專注於僅改善同步。在這項工作中，我們提出了一個名為MaskVAT的V2A生成模型，將全頻高質量通用音頻編解碼器與序列到序列的遮罩生成模型相互連接。這種組合同時允許對高音質、語義匹配和時間同步性進行建模。我們的結果表明，通過將高質量編解碼器與適當的預訓練音視覺特徵以及序列到序列並行結構相結合，我們一方面能夠產生高度同步的結果，同時在另一方面與非編解碼器生成音頻模型的最新技術相競爭。樣本視頻和生成的音頻可在https://maskvat.github.io 上找到。

English

Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models. Sample videos and generated audios are available at https://maskvat.github.io .

具有增強同步性的遮罩生成式視訊轉音頻Transformer

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

摘要

Support