マスク付き生成型ビデオ・オーディオ変換器と強化された同期性

要旨

ビデオからオーディオ（V2A）生成は、視覚的なビデオ特徴のみを活用して、シーンに合った説得力のある音を生成する技術です。重要な点として、生成された音の開始タイミングは、それに対応する視覚的なアクションと一致する必要があります。そうでない場合、不自然な同期のアーティファクトが生じます。最近の研究では、静止画像やビデオ特徴を条件とした音生成器の進化が探求されてきましたが、その多くは品質と意味的マッチングに焦点を当てており、同期を無視しているか、あるいは品質を多少犠牲にして同期の改善のみに集中しています。本研究では、MaskVATというV2A生成モデルを提案します。このモデルは、高品質な全帯域汎用オーディオコーデックと、シーケンス間マスク生成モデルを組み合わせています。この組み合わせにより、高音質、意味的マッチング、および時間的同期性を同時にモデル化することが可能です。私たちの結果は、高品質なコーデックと適切に事前学習された視聴覚特徴、およびシーケンス間並列構造を組み合わせることで、一方で高度に同期した結果を得つつ、非コーデック生成オーディオモデルの最先端技術と競争力を持つことを示しています。サンプルビデオと生成されたオーディオはhttps://maskvat.github.ioでご覧いただけます。

English

Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models. Sample videos and generated audios are available at https://maskvat.github.io .

マスク付き生成型ビデオ・オーディオ変換器と強化された同期性

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

要旨

Support