优化 ViViT 训练：动作识别的时间和内存减少

摘要

本文讨论了与视频Transformer相关的大量训练时间和内存消耗所带来的挑战，重点关注ViViT（Video Vision Transformer）模型，特别是作为动作识别任务基线的Factorised Encoder版本。分解编码器变体采用了许多最先进方法中采用的后融合方法。尽管在ViViT的不同变体中，该变体以其有利的速度/准确性权衡脱颖而出，但其相当大的训练时间和内存需求仍然构成了一个重要的准入障碍。我们的方法旨在降低这一障碍，其基础是在训练过程中冻结空间Transformer的概念。如果简单地这样做，将导致模型准确性降低。但我们展示了通过（1）适当初始化时间Transformer（负责处理时间信息的模块）（2）引入连接冻结空间表示（一个专门关注输入图像区域的模块）与时间Transformer的紧凑适配器模型，我们可以享受冻结空间Transformer的好处而不牺牲准确性。通过对6个基准测试的广泛实验，我们展示了我们提出的训练策略显著降低了训练成本（约50%）和内存消耗，同时与基线模型相比，保持或略微提高了高达1.79%的性能。我们的方法还解锁了利用更大的图像Transformer模型作为我们的空间Transformer，并在相同内存消耗下访问更多帧的能力。

English

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations ((a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by sim 50%) and memory consumption while maintaining or slightly improving performance by up to 1.79\% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption.

优化 ViViT 训练：动作识别的时间和内存减少

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

摘要

Support