視覺與聽覺：使用擴散潛在對齊器進行開放域視聽生成

摘要

影片和音訊內容創作是電影業和專業使用者的核心技術。最近，現有基於擴散的方法分別處理影片和音訊生成，這阻礙了從學術界向工業界的技術轉移。在這項工作中，我們旨在填補這一差距，通過一個精心設計的基於優化的框架，用於跨視覺-音訊和聯合視覺-音訊生成。我們觀察到現成的影片或音訊生成模型具有強大的生成能力。因此，我們提出了一種方法，不是從頭開始訓練龐大的模型，而是建立現有強大模型與共享潛在表示空間之間的橋樑。具體來說，我們提出了一種多模態潛在對齊器，使用預訓練的ImageBind模型。我們的潛在對齊器與在推論時引導擴散去噪過程的分類器引導具有相似的核心。通過精心設計的優化策略和損失函數，我們展示了我們的方法在聯合影片-音訊生成、視覺導向音訊生成和音訊導向視覺生成任務上的卓越性能。項目網站位於https://yzxing87.github.io/Seeing-and-Hearing/

English

Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/

視覺與聽覺：使用擴散潛在對齊器進行開放域視聽生成

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

摘要

Support