繪製音訊：利用多指令進行視訊轉音頻合成

摘要

「Foley」是在電影製作中常用的術語，指的是將日常音效添加到無聲電影或影片中，以增強聽覺體驗。視訊轉音訊（Video-to-Audio，V2A）作為一種特定類型的自動「Foley」任務，面臨與音視頻同步相關的固有挑戰。這些挑戰包括在輸入視頻和生成音頻之間保持內容一致性，以及在視頻中調整時間和音量特性的對齊。為了應對這些問題，我們構建了一個可控的視訊轉音訊合成模型，稱為「Draw an Audio」，通過繪製遮罩和音量信號支持多個輸入指令。為了確保合成音頻與目標視頻之間的內容一致性，我們引入了「Mask-Attention Module」（MAM），該模塊利用遮罩視頻指令使模型專注於感興趣的區域。此外，我們實現了「Time-Loudness Module」（TLM），該模塊使用輔助音量信號確保聲音的合成與視頻在音量和時間維度上保持一致。此外，我們通過添加標註標題提示，擴展了一個大規模的V2A數據集，名為「VGGSound-Caption」。在兩個大規模V2A數據集上進行的廣泛實驗證實了「Draw an Audio」實現了最先進的技術水準。項目頁面：https://yannqi.github.io/Draw-an-Audio/。

English

Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: https://yannqi.github.io/Draw-an-Audio/.