ChatPaper.aiChatPaper

AVESFormer:实时音频-视觉分割的高效Transformer设计

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

August 3, 2024
作者: Zili Wang, Qi Yang, Linsu Shi, Jiazhong Yu, Qinghua Liang, Fei Li, Shiming Xiang
cs.AI

摘要

最近,基于Transformer的模型在音频-视觉分割(AVS)任务中展现出卓越的性能。然而,它们昂贵的计算成本使得实时推断变得不切实际。通过对网络的注意力图进行表征,我们确定了AVS模型中的两个关键障碍:1)注意力消散,对应于Softmax在受限制的框架内过度集中的注意力权重,以及2)低效、繁重的Transformer解码器,由早期阶段的狭窄焦点模式引起。在本文中,我们介绍了AVESFormer,这是第一个实时音频-视觉高效分割Transformer,实现了快速、高效和轻量化的同时。我们的模型利用高效的提示查询生成器来纠正交叉注意力的行为。此外,我们提出了ELF解码器,通过促进适用于局部特征的卷积以减少计算负担,从而带来更大的效率。大量实验证明,我们的AVESFormer显著提升了模型性能,在S4上达到了79.9%,在MS3上达到了57.9%,在AVSS上达到了31.2%,胜过先前的最新技术,并实现了性能和速度之间的优秀折衷。代码可在https://github.com/MarkXCloud/AVESFormer.git 找到。
English
Recently, transformer-based models have demonstrated remarkable performance on audio-visual segmentation (AVS) tasks. However, their expensive computational cost makes real-time inference impractical. By characterizing attention maps of the network, we identify two key obstacles in AVS models: 1) attention dissipation, corresponding to the over-concentrated attention weights by Softmax within restricted frames, and 2) inefficient, burdensome transformer decoder, caused by narrow focus patterns in early stages. In this paper, we introduce AVESFormer, the first real-time Audio-Visual Efficient Segmentation transformer that achieves fast, efficient and light-weight simultaneously. Our model leverages an efficient prompt query generator to correct the behaviour of cross-attention. Additionally, we propose ELF decoder to bring greater efficiency by facilitating convolutions suitable for local features to reduce computational burdens. Extensive experiments demonstrate that our AVESFormer significantly enhances model performance, achieving 79.9% on S4, 57.9% on MS3 and 31.2% on AVSS, outperforming previous state-of-the-art and achieving an excellent trade-off between performance and speed. Code can be found at https://github.com/MarkXCloud/AVESFormer.git.

Summary

AI-Generated Summary

PDF42November 28, 2024