ChatPaper.aiChatPaper

Video-Foley:透過時間事件條件的兩階段影片轉音效生成,用於佛利聲音

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

August 21, 2024
作者: Junwon Lee, Jaekwon Im, Dabin Kim, Juhan Nam
cs.AI

摘要

Foley音效合成對於多媒體製作至關重要,透過在時間和語義上同步音頻和視頻,增強使用者體驗。最近關於通過視頻生成音效來自動化這一勞動密集型過程的研究面臨著重大挑戰。缺乏明確時間特徵的系統容易出現控制性和對齊性不佳的問題,而基於時間戳的模型則需要昂貴且主觀的人工標註。我們提出了Video-Foley,一種使用均方根(RMS)作為時間事件條件的視頻至音效系統,並搭配語義音色提示(音頻或文本)。RMS是一種與音頻語義密切相關的幀級強度包絡特徵,確保了高度可控性和同步性。這種無需標註的自監督學習框架包括兩個階段,Video2RMS和RMS2Sound,並融入了包括RMS離散化和具有預訓練文本至音頻模型的RMS-ControlNet在內的新思想。我們的廣泛評估顯示,Video-Foley在音頻和視覺對齊以及聲音時間、強度、音色和細微差異的可控性方面實現了最先進的性能。程式碼、模型權重和演示可在附帶網站上找到。(https://jnwnlee.github.io/video-foley-demo)
English
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor controllability and alignment, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as a temporal event condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope feature closely related to audio semantics, ensures high controllability and synchronization. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Code, model weights, and demonstrations are available on the accompanying website. (https://jnwnlee.github.io/video-foley-demo)

Summary

AI-Generated Summary

PDF72November 16, 2024