探物之声：基于交互式物体感知的图像至音频生成

摘要

在複雜的視聽場景中生成精確的聲音是一項挑戰，尤其是在存在多個物體和聲源的情況下。本文提出了一種基於用戶選擇圖像中視覺物體的互動式物體感知音頻生成模型。我們的方法將物體中心學習整合到條件潛在擴散模型中，該模型通過多模態注意力學習將圖像區域與其對應的聲音關聯起來。在測試時，我們的模型利用圖像分割技術，允許用戶在物體層面上互動式地生成聲音。我們從理論上驗證了我們的注意力機制在功能上近似於測試時的分割掩碼，確保生成的音頻與選定的物體保持一致。定量和定性評估表明，我們的模型優於基線方法，實現了物體與其相關聲音之間更好的對齊。項目頁面：https://tinglok.netlify.app/files/avobject/

English

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/

探物之声：基于交互式物体感知的图像至音频生成

Sounding that Object: Interactive Object-Aware Image to Audio Generation

摘要

Support