事件性轉換器：在視覺中利用時間冗餘的優勢 Transformers

摘要

視覺Transformer在各種視覺識別任務中取得了令人印象深刻的準確性。不幸的是，它們的準確性通常伴隨著高計算成本。這在視頻識別中尤為嚴重，因為模型通常會在幀或時間段中重複應用。在這項工作中，我們利用相鄰輸入之間的時間冗余來降低Transformer在視頻處理中的成本。我們描述了一種方法，用於識別並重新處理那些隨時間發生顯著變化的標記。我們提出的模型系列，稱為Eventful Transformers，可以從現有的Transformers轉換（通常無需重新訓練），並在運行時提供對計算成本的自適應控制。我們在大規模數據集上對視頻物體檢測（ImageNet VID）和動作識別（EPIC-Kitchens 100）進行了評估。我們的方法實現了顯著的計算節省（節省了2-4倍的計算成本），僅導致輕微的準確性降低。

English

Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.

事件性轉換器：在視覺中利用時間冗餘的優勢 Transformers

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

摘要

Support