보고 듣기: 확산 잠재 정렬기를 활용한 개방형 도메인 시각-음향 생성

초록

비디오 및 오디오 콘텐츠 제작은 영화 산업과 전문 사용자들을 위한 핵심 기술로 자리 잡고 있다. 최근 기존의 확산 기반 방법들은 비디오와 오디오 생성을 별도로 다루어, 학계에서 산업으로의 기술 이전을 방해하고 있다. 본 연구에서는 이러한 격차를 메우기 위해, 교차 시각-오디오 및 통합 시각-오디오 생성을 위한 신중하게 설계된 최적화 기반 프레임워크를 제안한다. 우리는 기존의 비디오 또는 오디오 생성 모델들의 강력한 생성 능력을 관찰하였다. 따라서 대규모 모델을 처음부터 학습시키는 대신, 기존의 강력한 모델들을 공유된 잠재 표현 공간으로 연결하는 방식을 제안한다. 구체적으로, 우리는 사전 학습된 ImageBind 모델을 활용한 다중 모달리티 잠재 정렬기를 제안한다. 우리의 잠재 정렬기는 추론 과정에서 확산 노이즈 제거 과정을 안내하는 분류기 지도와 유사한 핵심을 공유한다. 신중하게 설계된 최적화 전략과 손실 함수를 통해, 우리는 통합 비디오-오디오 생성, 시각 주도 오디오 생성, 오디오 주도 시각 생성 작업에서 우리 방법의 우수한 성능을 입증한다. 프로젝트 웹사이트는 https://yzxing87.github.io/Seeing-and-Hearing/에서 확인할 수 있다.

English

Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/

보고 듣기: 확산 잠재 정렬기를 활용한 개방형 도메인 시각-음향 생성

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

초록

Support