物体之声：交互式物体感知图像到音频生成

摘要

为复杂的视听场景生成精确的声音具有挑战性，尤其是在存在多个物体和声源的情况下。本文提出了一种{\em 交互式物体感知音频生成}模型，该模型将声音生成基于用户选择的图像中的视觉物体。我们的方法将物体中心学习整合到条件潜在扩散模型中，通过多模态注意力学习将图像区域与其对应的声音关联起来。在测试时，我们的模型利用图像分割技术，使用户能够在{\em 物体}级别交互式生成声音。我们从理论上验证了我们的注意力机制在功能上近似于测试时的分割掩码，确保生成的音频与所选物体保持一致。定量和定性评估表明，我们的模型优于基线方法，在物体与其关联声音之间实现了更好的对齐。项目页面：https://tinglok.netlify.app/files/avobject/

English

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/

物体之声：交互式物体感知图像到音频生成

Sounding that Object: Interactive Object-Aware Image to Audio Generation

摘要

Support