SEE-2-SOUND:零样本空间环境到空间声音
SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
June 6, 2024
作者: Rishit Dagli, Shivesh Prakash, Robert Wu, Houman Khosravani
cs.AI
摘要
生成结合视觉和听觉感官体验对沉浸式内容的消费至关重要。最近神经生成模型的进展使得可以跨多种形式如图像、文本、语音和视频创建高分辨率内容。尽管取得了这些成功,但在生成与视觉内容相辅相成的高质量空间音频方面仍存在显著差距。此外,当前音频生成模型在生成自然音频、语音或音乐方面表现出色,但在整合沉浸式体验所需的空间音频线索方面表现不佳。在这项工作中,我们介绍了SEE-2-SOUND,这是一种零样本方法,将任务分解为:(1)识别视觉感兴趣区域;(2)在3D空间中定位这些元素;(3)为每个元素生成单声道音频;以及(4)将它们整合到空间音频中。利用我们的框架,我们展示了为高质量视频、图像和互联网动态图像以及通过学习方法生成的媒体生成空间音频的引人入胜结果。
English
Generating combined visual and auditory sensory experiences is critical for
the consumption of immersive content. Recent advances in neural generative
models have enabled the creation of high-resolution content across multiple
modalities such as images, text, speech, and videos. Despite these successes,
there remains a significant gap in the generation of high-quality spatial audio
that complements generated visual content. Furthermore, current audio
generation models excel in either generating natural audio or speech or music
but fall short in integrating spatial audio cues necessary for immersive
experiences. In this work, we introduce SEE-2-SOUND, a zero-shot approach that
decomposes the task into (1) identifying visual regions of interest; (2)
locating these elements in 3D space; (3) generating mono-audio for each; and
(4) integrating them into spatial audio. Using our framework, we demonstrate
compelling results for generating spatial audio for high-quality videos,
images, and dynamic images from the internet, as well as media generated by
learned approaches.Summary
AI-Generated Summary