MambaEVT：使用狀態空間模型的事件流視覺物體追蹤

摘要

基於事件相機的視覺追踪近年來越來越受到關注，這是由於其獨特的成像原理以及低能耗、高動態範圍和密集時間分辨率的優勢。目前基於事件的追踪算法逐漸達到性能瓶頸，這是由於利用視覺Transformer和靜態模板進行目標物體定位。本文提出了一種新穎的基於Mamba的視覺追踪框架，採用具有線性複雜度的狀態空間模型作為骨幹網絡。搜索區域和目標模板被輸入視覺Mamba網絡進行同時特徵提取和交互。搜索區域的輸出標記將被輸入到追踪頭進行目標定位。更重要的是，我們考慮在追踪框架中引入一種動態模板更新策略，使用Memory Mamba網絡。通過考慮目標模板庫中樣本的多樣性並對模板記憶模塊進行適當調整，可以集成一個更有效的動態模板。動態和靜態模板的有效組合使我們基於Mamba的追踪算法能夠在多個大規模數據集（包括EventVOT、VisEvent和FE240hz）上實現準確性和計算成本之間的良好平衡。源代碼將在https://github.com/Event-AHU/MambaEVT 上發布。

English

Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on https://github.com/Event-AHU/MambaEVT

MambaEVT：使用狀態空間模型的事件流視覺物體追蹤

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

摘要

Support