ChatPaper.aiChatPaper

具有增强同步性的遮蔽式生成式视频到音频变换器

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

July 15, 2024
作者: Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà
cs.AI

摘要

视频到音频(V2A)生成利用仅视觉视频特征来生成与场景相匹配的合理声音。重要的是,生成的声音起始点应与与之对齐的视觉动作相匹配,否则会出现不自然的同步问题。最近的研究探索了在静止图像和视频特征上进行条件化声音生成器的进展,侧重于质量和语义匹配,而忽略了同步,或者通过牺牲一定程度的质量来专注于改善同步。在这项工作中,我们提出了一个名为MaskVAT的V2A生成模型,它将全频高质量通用音频编解码器与序列到序列的掩码生成模型相互连接。这种组合允许同时建模高音频质量、语义匹配和时间同步性。我们的结果表明,通过将高质量编解码器与适当的预训练音频-视觉特征以及序列到序列并行结构相结合,我们能够在一方面产生高度同步的结果,同时在另一方面与非编解码器生成音频模型的最新技术相竞争。示例视频和生成的音频可在https://maskvat.github.io 上找到。
English
Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models. Sample videos and generated audios are available at https://maskvat.github.io .

Summary

AI-Generated Summary

PDF82November 28, 2024