그 사물을 소리로 표현하기: 상호작용 가능한 사물 인식 이미지-오디오 생성

초록

복잡한 시청각 장면에 대한 정확한 사운드 생성은 특히 다수의 객체와 음원이 존재할 때 어려운 과제입니다. 본 논문에서는 사용자가 이미지 내에서 선택한 시각적 객체를 기반으로 사운드 생성을 수행하는 {\em 객체 인식형 상호작용 오디오 생성} 모델을 제안합니다. 우리의 방법은 객체 중심 학습을 조건부 잠재 확산 모델에 통합하여, 다중 모드 어텐션을 통해 이미지 영역과 해당 사운드를 연관시키는 방법을 학습합니다. 테스트 단계에서, 우리의 모델은 이미지 분할을 활용하여 사용자가 {\em 객체} 수준에서 상호적으로 사운드를 생성할 수 있도록 합니다. 우리는 이론적으로 어텐션 메커니즘이 테스트 시 분할 마스크를 기능적으로 근사화함으로써 생성된 오디오가 선택된 객체와 일치하도록 보장함을 검증합니다. 정량적 및 정성적 평가를 통해 우리의 모델이 기준 모델을 능가하며, 객체와 관련 사운드 간의 더 나은 정렬을 달성함을 보여줍니다. 프로젝트 페이지: https://tinglok.netlify.app/files/avobject/

English

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: https://tinglok.netlify.app/files/avobject/

그 사물을 소리로 표현하기: 상호작용 가능한 사물 인식 이미지-오디오 생성

Sounding that Object: Interactive Object-Aware Image to Audio Generation

초록

Support