SEE-2-SOUND:零樣本空間環境轉換為空間音效
SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
June 6, 2024
作者: Rishit Dagli, Shivesh Prakash, Robert Wu, Houman Khosravani
cs.AI
摘要
產生結合視覺和聽覺感官體驗對於沉浸式內容的消費至關重要。最近神經生成模型的進步使得能夠跨多種模態,如圖像、文本、語音和影片,創建高分辨率內容成為可能。儘管取得了這些成功,但在生成與視覺內容相輔相成的高質量空間音訊方面仍存在顯著差距。此外,目前的音訊生成模型在生成自然音訊、語音或音樂方面表現出色,但在整合沉浸式體驗所需的空間音訊提示方面則表現不佳。在本研究中,我們介紹了一種名為SEE-2-SOUND的零樣本方法,將任務分解為:(1)識別視覺感興趣區域;(2)在3D空間中定位這些元素;(3)為每個元素生成單聲道音訊;以及(4)將它們整合到空間音訊中。使用我們的框架,我們展示了為高質量影片、圖像和來自互聯網的動態圖像,以及由學習方法生成的媒體生成空間音訊的引人入勝結果。
English
Generating combined visual and auditory sensory experiences is critical for
the consumption of immersive content. Recent advances in neural generative
models have enabled the creation of high-resolution content across multiple
modalities such as images, text, speech, and videos. Despite these successes,
there remains a significant gap in the generation of high-quality spatial audio
that complements generated visual content. Furthermore, current audio
generation models excel in either generating natural audio or speech or music
but fall short in integrating spatial audio cues necessary for immersive
experiences. In this work, we introduce SEE-2-SOUND, a zero-shot approach that
decomposes the task into (1) identifying visual regions of interest; (2)
locating these elements in 3D space; (3) generating mono-audio for each; and
(4) integrating them into spatial audio. Using our framework, we demonstrate
compelling results for generating spatial audio for high-quality videos,
images, and dynamic images from the internet, as well as media generated by
learned approaches.Summary
AI-Generated Summary