讓您的訓練更具靈活性:邁向部署高效的視頻模型
Make Your Training Flexible: Towards Deployment-Efficient Video Models
March 18, 2025
作者: Chenting Wang, Kunchang Li, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang
cs.AI
摘要
主流的视频训练方法主要基于从预设时空网格中采样的固定数量标记进行操作,由于视频固有的冗余性,导致在准确性与计算量之间的权衡上表现欠佳。此外,这些方法缺乏对下游任务不同计算预算的适应性,阻碍了最具竞争力模型在现实场景中的应用。为此,我们提出了一种新的测试设置——标记优化(Token Optimization),旨在跨预算最大化输入信息,通过从更合适采样的视频中进行标记选择,优化了输入标记的有限集合。为此,我们引入了一种新颖的增强工具,称为Flux。通过使采样网格灵活化并利用标记选择,它能够轻松融入大多数流行的视频训练框架,以几乎无额外成本提升模型的鲁棒性。我们将Flux应用于大规模视频预训练中,由此产生的FluxViT在标准成本下,在广泛任务中确立了新的最先进成果。值得注意的是,仅使用1/4的标记,它仍能通过标记优化匹配先前最先进模型的性能,节省了近90%的资源。所有模型和数据均可在https://github.com/OpenGVLab/FluxViT获取。
English
Popular video training methods mainly operate on a fixed number of tokens
sampled from a predetermined spatiotemporal grid, resulting in sub-optimal
accuracy-computation trade-offs due to inherent video redundancy. They also
lack adaptability to varying computational budgets for downstream tasks,
hindering applications of the most competitive model in real-world scenes. We
thus propose a new test setting, Token Optimization, for maximized input
information across budgets, which optimizes the size-limited set of input
tokens through token selection from more suitably sampled videos. To this end,
we propose a novel augmentation tool termed Flux. By making the sampling grid
flexible and leveraging token selection, it is easily adopted in most popular
video training frameworks, boosting model robustness with nearly no additional
cost. We integrate Flux in large-scale video pre-training, and the resulting
FluxViT establishes new state-of-the-art results across extensive tasks at
standard costs. Notably, with 1/4 tokens only, it can still match the
performance of previous state-of-the-art models with Token Optimization,
yielding nearly 90\% savings. All models and data are available at
https://github.com/OpenGVLab/FluxViT.Summary
AI-Generated Summary