自监督式音频-视觉音景风格化
Self-Supervised Audio-Visual Soundscape Stylization
September 22, 2024
作者: Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli
cs.AI
摘要
语音声音传达了大量关于场景的信息,导致了从混响到额外环境声音等各种效果。在本文中,我们通过操作输入语音,使其听起来好像是在不同场景中录制的,给定了从该场景录制的音视频条件示例。我们的模型通过自监督学习,利用自然视频包含重复出现的声音事件和纹理的事实。我们从视频中提取音频片段并应用语音增强。然后,我们训练一个潜在扩散模型来恢复原始语音,使用另一个从视频中的其他地方获取的音视频片段作为条件提示。通过这个过程,模型学会了将条件示例的声音特性转移到输入语音中。我们展示了我们的模型可以成功地使用未标记的野外视频进行训练,并且额外的视觉信号可以提高其声音预测能力。请查看我们项目的网页以获取视频结果:https://tinglok.netlify.app/files/avsoundscape/
English
Speech sounds convey a great deal of information about the scenes, resulting
in a variety of effects ranging from reverberation to additional ambient
sounds. In this paper, we manipulate input speech to sound as though it was
recorded within a different scene, given an audio-visual conditional example
recorded from that scene. Our model learns through self-supervision, taking
advantage of the fact that natural video contains recurring sound events and
textures. We extract an audio clip from a video and apply speech enhancement.
We then train a latent diffusion model to recover the original speech, using
another audio-visual clip taken from elsewhere in the video as a conditional
hint. Through this process, the model learns to transfer the conditional
example's sound properties to the input speech. We show that our model can be
successfully trained using unlabeled, in-the-wild videos, and that an
additional visual signal can improve its sound prediction abilities. Please see
our project webpage for video results:
https://tinglok.netlify.app/files/avsoundscape/Summary
AI-Generated Summary