^RFLAV:用於無限音視頻生成的滾動流匹配技術
^RFLAV: Rolling Flow matching for infinite Audio Video generation
March 11, 2025
作者: Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
cs.AI
摘要
聯合音視頻(AV)生成在生成式人工智慧領域仍是一大挑戰,主要歸因於三個關鍵需求:生成樣本的質量、無縫的多模態同步與時間一致性(即音頻軌與視覺數據相互匹配),以及無限的視頻時長。本文提出了一種新穎的基於Transformer的架構,旨在解決AV生成中的所有關鍵難題。我們探討了三種不同的跨模態交互模塊,其中我們輕量級的時間融合模塊被證明是對齊音頻與視覺模態最為有效且計算效率最高的方法。實驗結果表明,在多模態AV生成任務中,該模型超越了現有的最先進模型。我們的代碼與檢查點已公開於https://github.com/ErgastiAlex/R-FLAV。
English
Joint audio-video (AV) generation is still a significant challenge in
generative AI, primarily due to three critical requirements: quality of the
generated samples, seamless multimodal synchronization and temporal coherence,
with audio tracks that match the visual data and vice versa, and limitless
video duration. In this paper, we present , a novel transformer-based
architecture that addresses all the key challenges of AV generation. We explore
three distinct cross modality interaction modules, with our lightweight
temporal fusion module emerging as the most effective and computationally
efficient approach for aligning audio and visual modalities. Our experimental
results demonstrate that outperforms existing state-of-the-art models
in multimodal AV generation tasks. Our code and checkpoints are available at
https://github.com/ErgastiAlex/R-FLAV.Summary
AI-Generated Summary