具有增強同步性的遮罩生成式視訊轉音頻Transformer
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
July 15, 2024
作者: Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà
cs.AI
摘要
影片轉音頻(V2A)生成利用僅視覺影片特徵來呈現與場景相匹配的合理聲音。重要的是,生成的聲音起始應與與之對齊的視覺動作相匹配,否則將產生不自然的同步問題。最近的研究探索了在靜止圖像上條件化聲音生成器的進展,然後是視頻特徵,專注於質量和語義匹配,同時忽略同步,或者犧牲一定程度的質量來專注於僅改善同步。在這項工作中,我們提出了一個名為MaskVAT的V2A生成模型,將全頻高質量通用音頻編解碼器與序列到序列的遮罩生成模型相互連接。這種組合同時允許對高音質、語義匹配和時間同步性進行建模。我們的結果表明,通過將高質量編解碼器與適當的預訓練音視覺特徵以及序列到序列並行結構相結合,我們一方面能夠產生高度同步的結果,同時在另一方面與非編解碼器生成音頻模型的最新技術相競爭。樣本視頻和生成的音頻可在https://maskvat.github.io 上找到。
English
Video-to-audio (V2A) generation leverages visual-only video features to
render plausible sounds that match the scene. Importantly, the generated sound
onsets should match the visual actions that are aligned with them, otherwise
unnatural synchronization artifacts arise. Recent works have explored the
progression of conditioning sound generators on still images and then video
features, focusing on quality and semantic matching while ignoring
synchronization, or by sacrificing some amount of quality to focus on improving
synchronization only. In this work, we propose a V2A generative model, named
MaskVAT, that interconnects a full-band high-quality general audio codec with a
sequence-to-sequence masked generative model. This combination allows modeling
both high audio quality, semantic matching, and temporal synchronicity at the
same time. Our results show that, by combining a high-quality codec with the
proper pre-trained audio-visual features and a sequence-to-sequence parallel
structure, we are able to yield highly synchronized results on one hand, whilst
being competitive with the state of the art of non-codec generative audio
models. Sample videos and generated audios are available at
https://maskvat.github.io .Summary
AI-Generated Summary