視覚と聴覚：拡散モデルを用いたオープンドメイン視覚-音声生成と潜在空間アライナー

要旨

映像と音声のコンテンツ制作は、映画産業やプロフェッショナルユーザーにとって中核的な技術です。最近では、既存の拡散モデルベースの手法が映像と音声の生成を別々に扱っており、これが学術界から産業界への技術移転を妨げています。本研究では、このギャップを埋めることを目指し、視覚-音声間および視覚-音声共同生成のための最適化ベースのフレームワークを慎重に設計しました。既存の映像や音声生成モデルの強力な生成能力を観察した結果、巨大なモデルをゼロから訓練するのではなく、既存の強力なモデルを共有潜在表現空間で橋渡しすることを提案します。具体的には、事前学習済みのImageBindモデルを用いたマルチモーダル潜在アライナーを提案します。この潜在アライナーは、推論時に拡散ノイズ除去プロセスを導く分類器ガイダンスと同様のコアを共有しています。慎重に設計された最適化戦略と損失関数を通じて、共同映像-音声生成、視覚誘導音声生成、音声誘導視覚生成タスクにおいて、本手法の優れた性能を示します。プロジェクトのウェブサイトはhttps://yzxing87.github.io/Seeing-and-Hearing/で確認できます。

English

Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/

視覚と聴覚：拡散モデルを用いたオープンドメイン視覚-音声生成と潜在空間アライナー

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

要旨

Support