事件性转换器：利用视觉中的时间冗余的优势。Transformer

摘要

视觉Transformer在各种视觉识别任务中取得了令人印象深刻的准确性。不幸的是，它们的准确性通常伴随着高计算成本。这在视频识别中尤为突出，因为模型经常在帧或时间块之间重复应用。在这项工作中，我们利用连续输入之间的时间冗余来降低视频处理中Transformer的成本。我们描述了一种方法，用于识别和重新处理那些随时间发生显着变化的令牌。我们提出的模型系列，称为“事件型Transformer”，可以从现有的Transformer转换而来（通常无需重新训练），并在运行时提供对计算成本的自适应控制。我们在大规模数据集上评估了我们的方法，包括视频目标检测（ImageNet VID）和动作识别（EPIC-Kitchens 100）。我们的方法导致了显著的计算节约（约2-4倍），仅在准确性上略有降低。

English

Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.

事件性转换器：利用视觉中的时间冗余的优势。Transformer

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

摘要

Support