PEEK: 効率的な知識蒸留による必須フレームの選択

要旨

ビデオ言語モデルは限られたフレーム数しか処理できないため、フレーム選択が効率的なビデオキャプショニングの主要なボトルネックとなっている。ほとんどのキャプショニングパイプラインは依然として均一サンプリングに依存しており、これは計算コストは低いが視覚的な内容に依存しない。適応フレームサンプリングは、ビデオから最も情報量の多いフレームを選択する有望なアプローチとして最近登場したが、既存の手法は依然として計算コストが高い。本稿では、キャプション条件付きフレーム関連性ランキングを強力な教師モデルから、視覚的な内容のみで動作する軽量な時間モデルに蒸留する、効率的な動的フレームサンプリング手法PEEKを紹介する。全体として、ActivityNet CaptionsおよびMSR-VTTにおいて、本手法は評価されたすべての下流視覚言語モデルで最先端手法を上回り、特にキャプショニングに1～2フレームのみが選択された場合に、ほとんどのフレーム予算で最高のCIDErを達成していることがわかった。ActivityNet Captionsでは、PEEKは特に優れており、16の設定中14で勝利した。MSR-VTTでのゼロショット評価では、本モデルは低フレーム予算で最もよく転移する一方、4フレームおよび8フレームでは、時間的カバレッジと視覚的多様性が競争力を増すため、結果はよりまちまちである。最近の適応ベースラインと比較して、PEEKは低予算領域でより正確であるだけでなく、より効率的である: キャプショニング時間に対する追加はわずか5.2%であり、CSTAの65.4%、MaxInfoの211.9%と対照的である。コードと事前学習済みチェックポイントをhttps://github.com/momentslab/peekで公開する。

English

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.