具有自回归的视频音频时间对齐

摘要

我们介绍了V-AURA，这是第一个能够在视频到音频生成中实现高时序对齐和相关性的自回归模型。V-AURA使用高帧率的视觉特征提取器和跨模态音频-视觉特征融合策略，捕获细粒度的视觉运动事件，并确保精确的时序对齐。此外，我们提出了VisualSound，一个具有高音频-视觉相关性的基准数据集。VisualSound基于VGGSound，这是一个包含来自YouTube的野外样本的视频数据集。在策划过程中，我们删除了音频事件与视觉事件不对齐的样本。V-AURA在时序对齐和语义相关性方面优于当前的最先进模型，同时保持可比较的音频质量。代码、样本、VisualSound和模型可在https://v-aura.notion.site找到。

English

We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at https://v-aura.notion.site

具有自回归的视频音频时间对齐

Temporally Aligned Audio for Video with Autoregression

摘要

Support