自监督式音频-视觉音景风格化

摘要

语音声音传达了大量关于场景的信息，导致了从混响到额外环境声音等各种效果。在本文中，我们通过操作输入语音，使其听起来好像是在不同场景中录制的，给定了从该场景录制的音视频条件示例。我们的模型通过自监督学习，利用自然视频包含重复出现的声音事件和纹理的事实。我们从视频中提取音频片段并应用语音增强。然后，我们训练一个潜在扩散模型来恢复原始语音，使用另一个从视频中的其他地方获取的音视频片段作为条件提示。通过这个过程，模型学会了将条件示例的声音特性转移到输入语音中。我们展示了我们的模型可以成功地使用未标记的野外视频进行训练，并且额外的视觉信号可以提高其声音预测能力。请查看我们项目的网页以获取视频结果：https://tinglok.netlify.app/files/avsoundscape/

English

Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. Please see our project webpage for video results: https://tinglok.netlify.app/files/avsoundscape/

自监督式音频-视觉音景风格化

Self-Supervised Audio-Visual Soundscape Stylization

摘要

Support