시간을 낭비할 수 없다: 모바일 비디오를 위한 채널 내 시간 압축 이해

초록

현재의 비디오 이해를 위한 아키텍처는 주로 3D 컨볼루션 블록이나 시간적 모델링을 위한 추가 연산이 포함된 2D 컨볼루션을 기반으로 구축됩니다. 그러나 이러한 방법들은 모두 시간 축을 비디오 시퀀스의 별도 차원으로 간주하여, 이는 큰 계산 및 메모리 예산을 필요로 하며, 따라서 모바일 기기에서의 사용을 제한합니다. 본 논문에서는 비디오 시퀀스의 시간 축을 채널 차원으로 압축하고, 모바일 비디오 이해를 위한 경량화된 비디오 인식 네트워크인 SqueezeTime을 제안합니다. 제안된 네트워크의 시간적 모델링 능력을 강화하기 위해, 우리는 시퀀스의 시간적 역학을 포착하기 위한 Channel-Time Learning (CTL) 블록을 설계했습니다. 이 모듈은 두 개의 상호 보완적인 브랜치를 가지며, 하나는 시간적 중요도 학습을 위한 것이고, 다른 하나는 시간적 위치 복원 능력을 갖춰 시간 간 객체 모델링 능력을 강화합니다. 제안된 SqueezeTime은 모바일 비디오 이해를 위해 매우 경량화되고 빠르며 높은 정확도를 보입니다. Kinetics400, Kinetics600, HMDB51, AVA2.1 및 THUMOS14와 같은 다양한 비디오 인식 및 행동 감지 벤치마크에서의 광범위한 실험을 통해 우리 모델의 우수성을 입증했습니다. 예를 들어, 우리의 SqueezeTime은 Kinetics400에서 기존 방법 대비 +1.2%의 정확도와 +80%의 GPU 처리량 향상을 달성했습니다. 코드는 https://github.com/xinghaochen/SqueezeTime 및 https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SqueezeTime에서 공개적으로 이용 가능합니다.

English

Current architectures for video understanding mainly build upon 3D convolutional blocks or 2D convolutions with additional operations for temporal modeling. However, these methods all regard the temporal axis as a separate dimension of the video sequence, which requires large computation and memory budgets and thus limits their usage on mobile devices. In this paper, we propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as SqueezeTime, for mobile video understanding. To enhance the temporal modeling capability of the proposed network, we design a Channel-Time Learning (CTL) Block to capture temporal dynamics of the sequence. This module has two complementary branches, in which one branch is for temporal importance learning and another branch with temporal position restoring capability is to enhance inter-temporal object modeling ability. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding. Extensive experiments on various video recognition and action detection benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14, demonstrate the superiority of our model. For example, our SqueezeTime achieves +1.2% accuracy and +80% GPU throughput gain on Kinetics400 than prior methods. Codes are publicly available at https://github.com/xinghaochen/SqueezeTime and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SqueezeTime.

시간을 낭비할 수 없다: 모바일 비디오를 위한 채널 내 시간 압축 이해

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

초록

Support