绘制一个音频：利用多指导进行视频到音频合成

摘要

“Foley”是电影制作中常用的术语，指的是在无声电影或视频中添加日常音效，以增强听觉体验。视频转音频（V2A）作为一种特定类型的自动foley任务，面临与音频-视觉同步相关的固有挑战。这些挑战涵盖了在输入视频和生成的音频之间保持内容一致性，以及视频中时间和响度属性的对齐。为了解决这些问题，我们构建了一个可控的视频转音频合成模型，名为“绘制音频（Draw an Audio）”，通过绘制蒙版和响度信号支持多个输入指令。为了确保合成音频与目标视频之间的内容一致性，我们引入了蒙版注意力模块（Mask-Attention Module，MAM），它利用蒙版视频指令使模型专注于感兴趣的区域。此外，我们实现了时间-响度模块（Time-Loudness Module，TLM），它使用辅助响度信号确保声音的合成与视频在响度和时间维度上保持一致。此外，我们通过注释标题提示扩展了一个大规模的V2A数据集，名为VGGSound-Caption。在两个大规模V2A数据集上进行的广泛实验验证了“绘制音频”达到了最先进水平。项目页面：https://yannqi.github.io/Draw-an-Audio/。

English

Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: https://yannqi.github.io/Draw-an-Audio/.