探物之声:基于交互式物体感知的图像至音频生成
Sounding that Object: Interactive Object-Aware Image to Audio Generation
June 4, 2025
作者: Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang
cs.AI
摘要
在複雜的視聽場景中生成精確的聲音是一項挑戰,尤其是在存在多個物體和聲源的情況下。本文提出了一種基於用戶選擇圖像中視覺物體的互動式物體感知音頻生成模型。我們的方法將物體中心學習整合到條件潛在擴散模型中,該模型通過多模態注意力學習將圖像區域與其對應的聲音關聯起來。在測試時,我們的模型利用圖像分割技術,允許用戶在物體層面上互動式地生成聲音。我們從理論上驗證了我們的注意力機制在功能上近似於測試時的分割掩碼,確保生成的音頻與選定的物體保持一致。定量和定性評估表明,我們的模型優於基線方法,實現了物體與其相關聲音之間更好的對齊。項目頁面:https://tinglok.netlify.app/files/avobject/
English
Generating accurate sounds for complex audio-visual scenes is challenging,
especially in the presence of multiple objects and sound sources. In this
paper, we propose an {\em interactive object-aware audio generation} model that
grounds sound generation in user-selected visual objects within images. Our
method integrates object-centric learning into a conditional latent diffusion
model, which learns to associate image regions with their corresponding sounds
through multi-modal attention. At test time, our model employs image
segmentation to allow users to interactively generate sounds at the {\em
object} level. We theoretically validate that our attention mechanism
functionally approximates test-time segmentation masks, ensuring the generated
audio aligns with selected objects. Quantitative and qualitative evaluations
show that our model outperforms baselines, achieving better alignment between
objects and their associated sounds. Project page:
https://tinglok.netlify.app/files/avobject/