リアルタイムUAV追跡のためのオクルージョンに強いVision Transformerの学習

要旨

Vision Transformer (ViT) バックボーンを使用したシングルストリームアーキテクチャは、最近、リアルタイムのUAV追跡において大きな可能性を示しています。しかし、建物や木などの障害物による頻繁なオクルージョンは、これらのモデルがオクルージョンを効果的に処理する戦略を欠いているという重大な欠点を露呈しています。空中追跡におけるシングルストリームViTモデルのオクルージョン耐性を向上させるための新しい手法が必要です。本研究では、空間的Coxプロセスによってモデル化されたランダムマスキング操作に対するターゲットの特徴表現の不変性を強制することで、UAV追跡のためのViTベースのオクルージョン耐性表現（ORR）を学習することを提案します。このランダムマスキングは、ターゲットのオクルージョンを近似的にシミュレートし、UAV追跡においてターゲットオクルージョンに強いViTを学習することを可能にします。このフレームワークはORTrackと名付けられています。さらに、リアルタイムアプリケーションを促進するために、タスクの難易度に応じて教師モデルORTrackの動作を適応的に模倣する、よりコンパクトなトラッカーを作成するための適応的特徴ベースの知識蒸留（AFKD）手法を提案します。この学生モデルはORTrack-Dと名付けられ、ORTrackの性能を大幅に維持しながら、より高い効率を提供します。複数のベンチマークでの広範な実験により、本手法の有効性が検証され、最先端の性能が実証されています。コードはhttps://github.com/wuyou3474/ORTrackで公開されています。

English

Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles like buildings and trees expose a major drawback: these models often lack strategies to handle occlusions effectively. New methods are needed to enhance the occlusion resilience of single-stream ViT models in aerial tracking. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks validate the effectiveness of our method, demonstrating its state-of-the-art performance. Codes is available at https://github.com/wuyou3474/ORTrack.

リアルタイムUAV追跡のためのオクルージョンに強いVision Transformerの学習

Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking

要旨

Support