ChatPaper.aiChatPaper

PEEK:通過高效知識蒸餾選取關鍵幀

PEEK: Picking Essential frames via Efficient Knowledge distillation

May 29, 2026
作者: Killian Steunou, Anas Filali Razzouki, Khalil Guetari, Mounîm A. El-Yacoubi, Yannis Tevissen
cs.AI

摘要

视频语言模型只能处理有限数量的帧,因此帧选择成为高效视频描述的关键瓶颈。大多数视频描述流程仍依赖均匀采样,这种方法计算成本低但忽略了视觉内容。自适应帧采样最近成为从视频中选取最具信息量帧的一种有前景的方法,然而现有方法计算成本仍然较高。我们提出PEEK,一种高效的动态帧采样方法,将基于描述条件的帧相关性排名从更强的教师模型蒸馏到仅基于视觉内容的轻量级时序模型中。总体而言,在ActivityNet Captions和MSR-VTT数据集上,我们的方法在所有评估的下游视觉语言模型中均优于现有最先进方法,特别是在仅选取一帧或两帧进行描述时表现最佳,在大多数帧预算下获得最佳CIDEr分数。在ActivityNet Captions数据集上,PEEK表现尤为突出,在16种配置中赢得14种。在MSR-VTT上的零样本评估表明,我们的模型在低帧预算下迁移效果最佳,而在四帧和八帧设置下,由于时序覆盖与视觉多样性竞争加剧,结果较为混杂。与近年来提出的自适应基线方法相比,PEEK在低预算场景下不仅准确率更高,而且效率更优:其仅增加描述处理时间5.2%,而CSTA增加65.4%,MaxInfo增加211.9%。我们在https://github.com/momentslab/peek开源了代码与预训练模型权重。
English
Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.