自己教師付きオーディオビジュアルサウンドスケープのスタイリゼーション

要旨

音声はシーンについて多くの情報を伝え、残響から追加の環境音までさまざまな効果をもたらします。本論文では、音声入力を操作して、その音声がそのシーンから記録されたと思われるような音に聞こえるようにします。そのために、そのシーンから記録されたオーディオビジュアル条件付きの例を使用します。当モデルは自己監督を通じて学習し、自然なビデオには繰り返し発生する音のイベントやテクスチャが含まれているという事実を利用します。ビデオからオーディオクリップを抽出し、音声の向上を適用します。次に、別のビデオ内の別のオーディオビジュアルクリップを条件ヒントとして使用して、元の音声を回復するための潜在的な拡散モデルをトレーニングします。このプロセスを通じて、モデルは入力音声に条件付き例の音の特性を転送することを学習します。当モデルは、未ラベルの野生のビデオを使用して成功裏にトレーニングでき、さらにビジュアル信号を使用することで音声予測能力を向上させることができることを示します。ビデオの結果については、以下のプロジェクトウェブページをご覧ください：https://tinglok.netlify.app/files/avsoundscape/

English

Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. Please see our project webpage for video results: https://tinglok.netlify.app/files/avsoundscape/

自己教師付きオーディオビジュアルサウンドスケープのスタイリゼーション

Self-Supervised Audio-Visual Soundscape Stylization

要旨

Support