AuralSAM2：通过金字塔音频-视觉特征提示赋予SAM2听觉能力

摘要

Segment Anything Model 2 (SAM2) 在视频片段的可提示分割中展现出强大的泛化能力，但其与音频模态的集成仍未被充分探索。现有方法要么通过基础模型将音频转换为视觉提示（例如边界框），要么向图像编码器注入适配器以实现音视频融合。然而，由于提示精度有限且推理开销增加，这两种方法在人机交互场景中均表现不足。尤其是基于适配器的方法常受音频提示稀释问题困扰——信号在网络传播过程中逐渐衰减。本文提出AuralSAM2，在基本保持SAM2可提示分割能力的前提下集成音频信息。其核心模块AuralFuser融合音频与视觉特征，生成稀疏与密集提示。这些提示以音频为引导，基于SAM2的特征金字塔将听觉线索传播至各视觉层，强化跨模态交互。为进一步对齐模态，我们引入音频引导的对比损失函数，强化主导视觉特征中的听觉相关性。该方法在公开基准测试中取得显著精度提升，同时对可提示分割的交互效率影响极小。代码已开源：https://github.com/yyliu01/AuralSAM2。

English

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.