沒有時間浪費:將時間擠壓到行動視頻通道中的理解
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding
May 14, 2024
作者: Yingjie Zhai, Wenshuo Li, Yehui Tang, Xinghao Chen, Yunhe Wang
cs.AI
摘要
目前用於視頻理解的架構主要基於3D卷積塊或2D卷積,並附加用於時間建模的額外操作。然而,這些方法都將時間軸視為視頻序列的獨立維度,這需要大量的計算和記憶體預算,因此限制了它們在移動設備上的使用。在本文中,我們提出將視頻序列的時間軸壓縮為通道維度,並提出了一種輕量級視頻識別網絡,稱為SqueezeTime,用於移動視頻理解。為了增強所提出網絡的時間建模能力,我們設計了一個通道-時間學習(CTL)塊來捕捉序列的時間動態。該模塊具有兩個互補的分支,其中一個分支用於學習時間重要性,另一個分支具有時間位置恢復能力,以增強跨時間對象建模能力。所提出的SqueezeTime在移動視頻理解方面更輕量且速度更快,並具有較高的準確性。對各種視頻識別和動作檢測基準進行了大量實驗,例如Kinetics400、Kinetics600、HMDB51、AVA2.1和THUMOS14,這些實驗證明了我們模型的優越性。例如,我們的SqueezeTime在Kinetics400上的準確性提高了+1.2%,GPU吞吐量提高了+80%,優於先前的方法。代碼可在以下鏈接公開獲取:https://github.com/xinghaochen/SqueezeTime 和 https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SqueezeTime。
English
Current architectures for video understanding mainly build upon 3D
convolutional blocks or 2D convolutions with additional operations for temporal
modeling. However, these methods all regard the temporal axis as a separate
dimension of the video sequence, which requires large computation and memory
budgets and thus limits their usage on mobile devices. In this paper, we
propose to squeeze the time axis of a video sequence into the channel dimension
and present a lightweight video recognition network, term as
SqueezeTime, for mobile video understanding. To enhance the temporal
modeling capability of the proposed network, we design a Channel-Time Learning
(CTL) Block to capture temporal dynamics of the sequence. This module has two
complementary branches, in which one branch is for temporal importance learning
and another branch with temporal position restoring capability is to enhance
inter-temporal object modeling ability. The proposed SqueezeTime is much
lightweight and fast with high accuracies for mobile video understanding.
Extensive experiments on various video recognition and action detection
benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14,
demonstrate the superiority of our model. For example, our SqueezeTime achieves
+1.2% accuracy and +80% GPU throughput gain on Kinetics400 than prior
methods. Codes are publicly available at
https://github.com/xinghaochen/SqueezeTime and
https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SqueezeTime.