時を超える響き：映像音声生成モデルにおける長さ一般化の解明

要旨

ビデオと音声の間のマルチモーダルな対応関係のスケーリングは、データの限界やテキスト記述とフレーム単位のビデオ情報の不一致により、困難な課題である。本研究では、マルチモーダル情報から音声を生成するタスクにおけるスケーリング課題に取り組み、短いインスタンスで学習したモデルが推論時に長いインスタンスへ一般化できるか検証する。この課題に対処するため、我々はMMHNetと称するマルチモーダル階層ネットワークを提案する。これは既存の最先端ビデオ-音声生成モデルを拡張したものである。本手法は階層的アプローチと非因果的Mambaを統合し、長尺音声生成を可能にする。提案手法は5分以上にわたる長尺音声生成を大幅に改善する。また、より長いデータで学習することなく、ビデオから音声を生成するタスクにおいて「短いデータで学習し、長いデータで推論する」ことが可能であることを実証する。実験により、提案手法が長尺ビデオ音声生成ベンチマークで優れた結果を達成し、従来のビデオ-音声タスクにおける手法を凌駕することを示す。さらに、従来手法が長尺生成に課題を抱える中、我々のモデルが5分以上の生成を可能にする能力を実証する。

English

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

時を超える響き：映像音声生成モデルにおける長さ一般化の解明

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

要旨

Support