イベントフル・トランスフォーマー：ビジョントランスフォーマーにおける時間的冗長性の活用

要旨

Vision Transformersは、幅広い視覚認識タスクにおいて印象的な精度を達成しています。しかし残念ながら、その精度はしばしば高い計算コストを伴います。これは特にビデオ認識において顕著な問題で、モデルがフレームや時間的チャンクに繰り返し適用されることが多いためです。本研究では、ビデオ処理におけるTransformersのコストを削減するために、連続する入力間の時間的冗長性を活用します。我々は、時間の経過とともに大きく変化したトークンのみを特定し再処理する方法を提案します。提案するEventful Transformersモデルファミリーは、既存のTransformersから変換可能（多くの場合再トレーニングなしで）であり、実行時の計算コストを適応的に制御できます。大規模なビデオ物体検出（ImageNet VID）と行動認識（EPIC-Kitchens 100）データセットを用いて本手法を評価しました。その結果、精度のわずかな低下を伴いながらも、計算量を大幅に削減（2～4倍程度）できることが示されました。

English

Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.

イベントフル・トランスフォーマー：ビジョントランスフォーマーにおける時間的冗長性の活用

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

要旨

Support