^RFLAV: 無限のオーディオビデオ生成のためのローリングフローマッチング

要旨

オーディオとビデオの共同生成（AV生成）は、生成AIにおける重要な課題であり、主に以下の3つの要件が原因となっています：生成サンプルの品質、シームレスなマルチモーダル同期と時間的一貫性（音声トラックが視覚データと一致し、その逆も同様であること）、そして無制限のビデオ再生時間。本論文では、AV生成の主要な課題をすべて解決する、新しいトランスフォーマーベースのアーキテクチャを提案します。3つの異なるクロスモダリティ相互作用モジュールを探求し、軽量な時間的融合モジュールが、音声と視覚モダリティを整列させるための最も効果的で計算効率の高いアプローチであることを明らかにしました。実験結果は、提案モデルがマルチモーダルAV生成タスクにおいて、既存の最先端モデルを凌駕することを示しています。コードとチェックポイントはhttps://github.com/ErgastiAlex/R-FLAVで公開されています。

English

Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present , a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.

^RFLAV: 無限のオーディオビデオ生成のためのローリングフローマッチング

^RFLAV: Rolling Flow matching for infinite Audio Video generation

要旨

Support