PEEK: 효율적 지식 증류를 통한 핵심 프레임 선택

초록

비디오-언어 모델은 제한된 수의 프레임만 처리할 수 있으므로, 프레임 선택은 효율적인 비디오 캡셔닝의 핵심 병목 현상이다. 대부분의 캡셔닝 파이프라인은 여전히 균일 샘플링에 의존하는데, 이는 계산 비용이 저렴하지만 시각적 콘텐츠에 무관하다. 최근 적응형 프레임 샘플링이 비디오에서 가장 유용한 프레임을 선택하는 유망한 접근 방식으로 등장했지만, 기존 방법은 여전히 계산 비용이 많이 든다. 본 논문에서는 PEEK을 제안한다. 이는 캡션 조건부 프레임 관련성 순위를 더 강력한 교사 모델로부터 증류하여 시각적 콘텐츠만을 기반으로 작동하는 경량 시간적 모델에 전이하는 효율적인 동적 프레임 샘플링 방법이다. 전반적으로 ActivityNet Captions와 MSR-VTT에서, 우리 방법은 평가된 모든 하위 비전-언어 모델에서 최신 기술을 능가하며, 특히 캡셔닝을 위해 1~2개의 프레임만 선택할 때 대부분의 프레임 예산에서 최고의 CIDEr를 달성한다. ActivityNet Captions에서 PEEK은 특히 강력하여 16개 구성 중 14개에서 우수한 성능을 보인다. MSR-VTT에 대한 제로샷 평가는 낮은 프레임 예산에서 우리 모델이 가장 우수하게 전이됨을 보여주는 반면, 4개 및 8개 프레임에서의 결과는 시간적 범위와 시각적 다양성이 점점 경쟁력을 갖춤에 따라 더 혼합된 양상을 보인다. 최근의 적응형 기준선과 비교할 때, PEEK은 낮은 예산 영역에서 더 정확할 뿐만 아니라 더 효율적이다: 캡셔닝 시간에 CSTA의 65.4%와 MaxInfo의 211.9%에 비해 단 5.2%만 추가된다. 코드와 사전 학습된 체크포인트는 https://github.com/momentslab/peek에서 공개한다.

English

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.