優化 ViViT 訓練：行動識別的時間和記憶降低

摘要

本文討論了與視頻Transformer相關的大量訓練時間和內存消耗所帶來的挑戰，專注於ViViT（Video Vision Transformer）模型，特別是作為動作識別任務基準的分解編碼器版本。分解編碼器變體採用了許多最先進方法中採用的後融合方法。儘管在ViViT的不同變體中，分解編碼器以其有利的速度/準確性折衷而脫穎而出，但其相當可觀的訓練時間和內存需求仍然構成了一個重要的進入障礙。我們的方法旨在降低這一障礙，其基礎是凍結空間Transformer的想法。這將導致低準確性模型，如果單純地進行。但我們表明，通過（1）適當初始化時間Transformer（負責處理時間信息的模塊）（2）引入一個連接凍結空間表示（一個選擇性關注輸入圖像區域的模塊）與時間Transformer的緊湊適配器模型，我們可以在不犧牲準確性的情況下享受凍結空間Transformer的好處。通過對6個基準進行廣泛實驗，我們證明了我們提出的訓練策略顯著降低了訓練成本（約50%）和內存消耗，同時與基準模型相比，保持或略微提高了性能，最高可達1.79%。我們的方法還可以解鎖利用更大的圖像Transformer模型作為我們的空間Transformer並在相同內存消耗下訪問更多幀的能力。

English

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations ((a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by sim 50%) and memory consumption while maintaining or slightly improving performance by up to 1.79\% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption.

優化 ViViT 訓練：行動識別的時間和記憶降低

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

摘要

Support