비디오-폴리: 시간적 이벤트 조건을 통한 두 단계 비디오-음향 생성을 위한 폴리 사운드

초록

멀티미디어 제작에 있어서 포리 사운드 합성은 사용자 경험을 향상시키는 데 중요하며, 오디오와 비디오를 시간적 및 의미론적으로 동기화하여 동기화합니다. 최근의 연구는 비디오에서 사운드를 자동 생성하여 이러한 노동 집약적인 프로세스를 자동화하는 것에 집중하고 있지만 중요한 도전에 직면하고 있습니다. 명시적 시간적 특징이 없는 시스템은 제어성과 정렬에 문제가 있으며, 타임스탬프 기반 모델은 비용이 많이 들며 주관적인 인간 주석이 필요합니다. 저희는 Root Mean Square (RMS)를 시간적 이벤트 조건으로 사용하고 음향적 의미 프롬프트(오디오 또는 텍스트)와 함께 사용하는 비디오-폴리 시스템을 제안합니다. RMS는 오디오 의미론과 밀접한 관련이 있는 프레임 수준의 강도 엔벨롭 특징으로 높은 제어성과 동기화를 보장합니다. 주석이 없는 자기 지도 학습 프레임워크는 Video2RMS 및 RMS2Sound 두 단계로 구성되어 있으며, RMS 이산화 및 사전 훈련된 텍스트-오디오 모델을 활용한 RMS-ControlNet과 같은 새로운 아이디어가 포함되어 있습니다. 저희의 포괄적인 평가 결과, 비디오-폴리는 사운드의 타이밍, 강도, 음색 및 뉴안스에 대한 오디오-비주얼 정렬 및 제어성에서 최첨단 성능을 달성한다는 것을 보여줍니다. 코드, 모델 가중치 및 데모는 동봉된 웹사이트에서 확인할 수 있습니다. (https://jnwnlee.github.io/video-foley-demo)

English

Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor controllability and alignment, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope feature closely related to audio semantics, ensures high controllability and synchronization. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Code, model weights, and demonstrations are available on the accompanying website. (https://jnwnlee.github.io/video-foley-demo)