AuralSAM2：透過金字塔音視覺特徵提示賦予SAM2聽覺能力

摘要

分段任意模型2（SAM2）在視訊片段的可提示分割中展現出強大的泛化能力，然而其與音頻模態的整合仍未被充分探索。現有方法要麼透過基礎模型將音頻轉換為視覺提示（例如邊界框），要麼在影像編碼器中注入適配器以實現音頻-視覺融合。然而，這兩種方向在人機迴圈場景中均存在不足，原因在於提示精度有限且推理開銷增加。特別是，這些基於適配器的方法常遭受音頻提示稀釋問題，即訊號在網路中傳播時逐漸減弱。在本工作中，我們提出AuralSAM2，該方法在將音頻整合至SAM2的同時，大致保留其可提示分割能力。其核心模組AuralFuser融合音頻與視覺特徵，生成稀疏與密集提示。這些提示以音頻為引導，基於SAM2的特徵金字塔，將聽覺線索傳播至各視覺層，強化跨模態影響。為進一步對齊模態，我們引入一種音頻引導的對比損失，該損失強調主導視覺特徵中的聽覺相關性。本方法在公開基準上取得顯著精確度提升，同時對可提示分割的互動效率影響極小。我們的程式碼已公開於 https://github.com/yyliu01/AuralSAM2。

English

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.