PEEK: 通过高效知识蒸馏选取关键帧
PEEK: Picking Essential frames via Efficient Knowledge distillation
May 29, 2026
作者: Killian Steunou, Anas Filali Razzouki, Khalil Guetari, Mounîm A. El-Yacoubi, Yannis Tevissen
cs.AI
摘要
视频-语言模型只能处理有限数量的帧,因此帧选择成为高效视频字幕生成的关键瓶颈。大多数字幕生成流程仍依赖均匀采样——虽然计算成本低,但忽略了视觉内容。自适应帧采样作为从视频中选取最具信息量帧的有前景方法近来备受关注,但现有方法仍存在计算开销大的问题。我们提出PEEK,一种高效的动态帧采样方法,通过将字幕条件化的帧相关性排名从更强的教师模型蒸馏到仅基于视觉内容的轻量级时间模型中。总体而言,在ActivityNet Captions和MSR-VTT数据集上,我们的方法在所有评估的下游视觉语言模型上均优于现有最先进方法,尤其在仅选取一帧或两帧进行字幕生成时表现突出,在多数帧预算设置下取得最佳CIDEr分数。在ActivityNet Captions上,PEEK表现尤为强劲,在16种配置中赢得14项。在MSR-VTT上的零样本评估显示,我们的模型在低帧预算下迁移效果最佳,而在四帧和八帧设置下,由于时间覆盖和视觉多样性竞争日益激烈,结果更为复杂。与近期自适应基线方法相比,PEEK在低预算场景下准确率更高且效率更优:其字幕生成时间仅增加5.2%,而CSTA和MaxInfo分别增加65.4%和211.9%。我们在https://github.com/momentslab/peek 公开代码和预训练模型权重。
English
Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.