オーディオを生成する：ビデオからオーディオへの合成のためのマルチインストラクションの活用

要旨

フリーは映画製作で一般的に使用される用語であり、無音の映画やビデオに日常的な効果音を追加して聴覚体験を向上させることを指します。ビデオからオーディオ（V2A）は、自動フリーの特定タイプとして、オーディオとビジュアルの同期に関連する固有の課題を提起します。これらの課題には、入力ビデオと生成されたオーディオのコンテンツの一貫性を維持すること、およびビデオ内の時間的および音量の特性の整合性が含まれます。これらの問題に対処するために、Draw an Audioと呼ばれる制御可能なビデオからオーディオ合成モデルを構築します。このモデルは、描かれたマスクと音量信号を介して複数の入力指示をサポートします。合成されたオーディオとターゲットビデオのコンテンツの一貫性を確保するために、マスク・アテンション・モジュール（MAM）を導入します。このモジュールは、マスクされたビデオ指示を使用して、モデルが興味のある領域に焦点を当てるようにします。さらに、時間・音量モジュール（TLM）を実装し、ビデオの音量と時間の両面でビデオに合わせた音の合成を確実にします。さらに、VGGSound-Captionという大規模なV2Aデータセットを拡張し、キャプションプロンプトを注釈付けしました。2つの大規模なV2Aデータセット全体での厳しいベンチマーク実験により、Draw an Audioが最先端の性能を達成することが確認されました。プロジェクトページ：https://yannqi.github.io/Draw-an-Audio/。

English

Foley is a term commonly used in filmmaking, referring to the addition of daily sound effects to silent films or videos to enhance the auditory experience. Video-to-Audio (V2A), as a particular type of automatic foley task, presents inherent challenges related to audio-visual synchronization. These challenges encompass maintaining the content consistency between the input video and the generated audio, as well as the alignment of temporal and loudness properties within the video. To address these issues, we construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. To ensure content consistency between the synthesized audio and target video, we introduce the Mask-Attention Module (MAM), which employs masked video instruction to enable the model to focus on regions of interest. Additionally, we implement the Time-Loudness Module (TLM), which uses an auxiliary loudness signal to ensure the synthesis of sound that aligns with the video in both loudness and temporal dimensions. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption, by annotating caption prompts. Extensive experiments on challenging benchmarks across two large-scale V2A datasets verify Draw an Audio achieves the state-of-the-art. Project page: https://yannqi.github.io/Draw-an-Audio/.