ViViT 학습 최적화: 동작 인식을 위한 시간 및 메모리 감소

초록

본 논문에서는 비디오 트랜스포머, 특히 ViViT(Video Vision Transformer) 모델 중 Factorised Encoder 버전을 기반으로 한 동작 인식 작업에서 발생하는 상당한 학습 시간과 메모리 소비 문제를 다룬다. Factorised Encoder 변형은 최신 접근법에서 널리 채택된 후기 융합(late-fusion) 방식을 따른다. ViViT의 다양한 변형 중에서도 속도와 정확도의 균형이 우수함에도 불구하고, 이 모델의 상당한 학습 시간과 메모리 요구 사항은 여전히 주요 진입 장벽으로 작용한다. 본 연구에서는 이러한 장벽을 낮추기 위해 공간 트랜스포머를 학습 중에 고정(freezing)하는 아이디어를 기반으로 한 방법을 제안한다. 이 방법은 단순히 적용할 경우 낮은 정확도를 초래하지만, (1) 시간적 정보를 처리하는 모듈인 시간 트랜스포머를 적절히 초기화하고, (2) 고정된 공간 표현(입력 이미지의 특정 영역에 선택적으로 주목하는 모듈)과 시간 트랜스포머를 연결하는 컴팩트한 어댑터 모델을 도입함으로써 정확도를 희생하지 않으면서 공간 트랜스포머를 고정하는 이점을 누릴 수 있음을 보여준다. 6개의 벤치마크에 대한 광범위한 실험을 통해, 제안된 학습 전략이 학습 비용을 약 50% 절감하고 메모리 소비를 크게 줄이면서도 기준 모델 대비 최대 1.79%의 성능 향상을 달성할 수 있음을 입증한다. 또한, 이 접근법은 더 큰 이미지 트랜스포머 모델을 공간 트랜스포머로 활용하고 동일한 메모리 소비로 더 많은 프레임을 처리할 수 있는 가능성을 열어준다.

English

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations ((a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by sim 50%) and memory consumption while maintaining or slightly improving performance by up to 1.79\% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption.

ViViT 학습 최적화: 동작 인식을 위한 시간 및 메모리 감소

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

초록

Support