이벤트풀 트랜스포머: 비전 트랜스포머에서의 시간적 중복성 활용

초록

비전 트랜스포머(Vision Transformers)는 다양한 시각 인식 작업에서 인상적인 정확도를 달성합니다. 그러나 이러한 정확도는 종종 높은 계산 비용을 수반한다는 문제가 있습니다. 이는 특히 비디오 인식에서 두드러지는데, 모델이 프레임이나 시간적 청크에 반복적으로 적용되기 때문입니다. 본 연구에서는 후속 입력 간의 시간적 중복성을 활용하여 비디오 처리용 트랜스포머의 비용을 줄이는 방법을 제안합니다. 시간에 따라 크게 변화한 토큰만 식별하고 재처리하는 방법을 설명합니다. 우리가 제안한 이벤트풀 트랜스포머(Eventful Transformers) 모델군은 기존 트랜스포머에서 변환될 수 있으며(종종 재훈련 없이도 가능), 런타임에서 계산 비용에 대한 적응형 제어를 제공합니다. 우리는 비디오 객체 검출(ImageNet VID) 및 행동 인식(EPIC-Kitchens 100)을 위한 대규모 데이터셋에서 이 방법을 평가했습니다. 우리의 접근 방식은 정확도의 미미한 감소만으로도 상당한 계산 비용 절감(약 2-4배)을 이끌어냈습니다.

English

Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.

이벤트풀 트랜스포머: 비전 트랜스포머에서의 시간적 중복성 활용

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

초록

Support