AVESFormer：針對即時音視覺分割設計的高效Transformer

摘要

最近，基於Transformer的模型在音視覺分割（AVS）任務上展現出卓越的性能。然而，其昂貴的計算成本使得實時推斷變得不切實際。通過對網絡的注意力地圖進行特徵化，我們確定了AVS模型中的兩個關鍵障礙：1）注意力消散，對應於Softmax在受限幀內過度集中的注意力權重，以及2）低效、繁重的Transformer解碼器，由早期階段的狹窄焦點模式引起。在本文中，我們介紹了AVESFormer，這是第一個實時音視覺高效分割Transformer，同時實現了快速、高效和輕量級。我們的模型利用高效的提示查詢生成器來糾正交叉注意力的行為。此外，我們提出ELF解碼器，通過促進適用於局部特徵的卷積以減輕計算負擔，從而帶來更大的效率。大量實驗表明，我們的AVESFormer顯著提升了模型性能，在S4上達到了79.9%，在MS3上達到了57.9%，在AVSS上達到了31.2%，優於先前的最新技術水準，實現了性能和速度之間的優秀折衷。代碼可在https://github.com/MarkXCloud/AVESFormer.git找到。

English

Recently, transformer-based models have demonstrated remarkable performance on audio-visual segmentation (AVS) tasks. However, their expensive computational cost makes real-time inference impractical. By characterizing attention maps of the network, we identify two key obstacles in AVS models: 1) attention dissipation, corresponding to the over-concentrated attention weights by Softmax within restricted frames, and 2) inefficient, burdensome transformer decoder, caused by narrow focus patterns in early stages. In this paper, we introduce AVESFormer, the first real-time Audio-Visual Efficient Segmentation transformer that achieves fast, efficient and light-weight simultaneously. Our model leverages an efficient prompt query generator to correct the behaviour of cross-attention. Additionally, we propose ELF decoder to bring greater efficiency by facilitating convolutions suitable for local features to reduce computational burdens. Extensive experiments demonstrate that our AVESFormer significantly enhances model performance, achieving 79.9% on S4, 57.9% on MS3 and 31.2% on AVSS, outperforming previous state-of-the-art and achieving an excellent trade-off between performance and speed. Code can be found at https://github.com/MarkXCloud/AVESFormer.git.

AVESFormer：針對即時音視覺分割設計的高效Transformer

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

摘要

Support