AuralSAM2: 피라미드 시청각 특징 프롬프팅을 통한 SAM2의 청각 기능 활성화

초록

세그먼트 애니씽 모델 2(SAM2)는 비디오 클립에서 프롬프트 가능 분할에 대해 강력한 일반화 성능을 보이지만, 오디오 모달리티와의 통합은 아직 충분히 탐구되지 않았다. 기존 접근법은 기반 모델을 통해 오디오를 시각적 프롬프트(예: 박스)로 변환하거나, 이미지 인코더에 어댑터를 주입하여 시청각 융합을 수행한다. 그러나 두 접근법 모두 제한된 프롬프트 정확도와 증가된 추론 오버헤드로 인해 인간 개입 시나리오에서 부족함을 보인다. 특히, 이러한 어댑터 기반 방법은 네트워크를 통해 신호가 전파됨에 따라 점차 약화되는 오디오 프롬프트 희석 현상을 자주 겪는다. 본 연구에서는 AuralSAM2를 제안한다. 이는 SAM2의 프롬프트 가능 분할 능력을 대부분 유지하면서 오디오를 통합한다. 핵심 모듈인 AuralFuser는 오디오와 시각적 특징을 융합하여 희소 및 밀집 프롬프트를 생성한다. 오디오의 안내를 받고 SAM2의 특징 피라미드를 기반으로 하는 이러한 프롬프트는 시각적 계층 전반에 걸쳐 청각적 단서를 전파하여 교차 모달 영향을 강화한다. 모달리티를 더욱 정렬하기 위해, 지배적인 시각적 특징에서 청각적 관련성을 강조하는 오디오 유도 대비 손실을 도입한다. 제안 방법은 프롬프트 가능 분할의 상호작용 효율성에 최소한의 영향만을 미치면서 공개 벤치마크에서 주목할 만한 정확도 향상을 달성한다. 코드는 https://github.com/yyliu01/AuralSAM2에서 확인할 수 있다.

English

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.