视听感知：使用扩散潜变对齐器进行开放领域视听生成

摘要

视频和音频内容创作是电影行业和专业用户的核心技术。最近，现有的基于扩散的方法分别处理视频和音频生成，这阻碍了技术从学术界向工业界的转移。在这项工作中，我们旨在填补这一空白，提出了一个经过精心设计的基于优化的框架，用于跨视听和联合视听生成。我们观察到现成视频或音频生成模型的强大生成能力。因此，我们提出通过将现有强大模型与共享潜在表示空间相连来弥合这一差距，而不是从头开始训练庞大模型。具体来说，我们提出了一个与预训练的ImageBind模型相结合的多模态潜在对齐器。我们的潜在对齐器与在推断时引导扩散去噪过程的分类器引导具有相似的核心。通过精心设计的优化策略和损失函数，我们展示了我们的方法在联合视频音频生成、视觉引导音频生成和音频引导视觉生成任务上的卓越性能。项目网站可在https://yzxing87.github.io/Seeing-and-Hearing/找到。

English

Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/

视听感知：使用扩散潜变对齐器进行开放领域视听生成

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

摘要

Support