AVESFormer: 실시간 오디오-비주얼 세분화를 위한 효율적인 트랜스포머 디자인

초록

최근에는 트랜스포머 기반 모델이 오디오-비주얼 분할 (AVS) 작업에서 놀라운 성능을 보여주고 있습니다. 그러나 그들의 비싼 계산 비용으로 실시간 추론이 불가능해집니다. 네트워크의 어텐션 맵을 특성화함으로써, 우리는 AVS 모델에서 두 가지 주요 장애물을 식별합니다: 1) 어텐션 소멸, 제한된 프레임 내에서 소프트맥스에 의한 과도한 집중된 어텐션 가중치에 해당하며, 2) 비효율적이고 부담스러운 트랜스포머 디코더, 초기 단계에서 좁은 초점 패턴에 의해 발생합니다. 본 논문에서는, 우리는 빠르고 효율적이며 가벼운 AVESFormer를 소개합니다. 이는 첫 번째 실시간 오디오-비주얼 효율적 분할 트랜스포머로, 빠르고 효율적이며 가벼운 것을 동시에 달성합니다. 우리의 모델은 효율적인 프롬프트 쿼리 생성기를 활용하여 교차 어텐션의 동작을 수정합니다. 게다가, 우리는 지역 특징에 적합한 컨볼루션을 용이하게 하는 ELF 디코더를 제안하여 계산 부담을 줄이는 데 큰 효율성을 가져옵니다. 광범위한 실험 결과는 우리의 AVESFormer가 모델 성능을 크게 향상시키며, S4에서 79.9%, MS3에서 57.9%, AVSS에서 31.2%를 달성하여 이전 최첨단 기술을 능가하고 성능과 속도 사이의 훌륭한 균형을 달성한다는 것을 보여줍니다. 코드는 https://github.com/MarkXCloud/AVESFormer.git에서 확인할 수 있습니다.

English

Recently, transformer-based models have demonstrated remarkable performance on audio-visual segmentation (AVS) tasks. However, their expensive computational cost makes real-time inference impractical. By characterizing attention maps of the network, we identify two key obstacles in AVS models: 1) attention dissipation, corresponding to the over-concentrated attention weights by Softmax within restricted frames, and 2) inefficient, burdensome transformer decoder, caused by narrow focus patterns in early stages. In this paper, we introduce AVESFormer, the first real-time Audio-Visual Efficient Segmentation transformer that achieves fast, efficient and light-weight simultaneously. Our model leverages an efficient prompt query generator to correct the behaviour of cross-attention. Additionally, we propose ELF decoder to bring greater efficiency by facilitating convolutions suitable for local features to reduce computational burdens. Extensive experiments demonstrate that our AVESFormer significantly enhances model performance, achieving 79.9% on S4, 57.9% on MS3 and 31.2% on AVSS, outperforming previous state-of-the-art and achieving an excellent trade-off between performance and speed. Code can be found at https://github.com/MarkXCloud/AVESFormer.git.

AVESFormer: 실시간 오디오-비주얼 세분화를 위한 효율적인 트랜스포머 디자인

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

초록

Support