스트리밍 방식의 고밀도 비디오 캡션 생성

초록

밀집 비디오 캡셔닝(비디오 내 시간적으로 국한된 캡션 예측)을 위한 이상적인 모델은 긴 입력 비디오를 처리할 수 있고, 풍부하고 상세한 텍스트 설명을 예측하며, 전체 비디오를 처리하기 전에 출력을 생성할 수 있어야 합니다. 그러나 현재 최첨단 모델들은 고정된 수의 다운샘플링된 프레임을 처리하고, 전체 비디오를 본 후에 단일 전체 예측을 수행합니다. 우리는 두 가지 새로운 구성 요소로 이루어진 스트리밍 밀집 비디오 캡셔닝 모델을 제안합니다: 첫째, 들어오는 토큰을 클러스터링하는 새로운 메모리 모듈을 제안하여 메모리 크기가 고정된 상태에서 임의로 긴 비디오를 처리할 수 있습니다. 둘째, 전체 비디오가 처리되기 전에 예측을 할 수 있게 해주는 스트리밍 디코딩 알고리즘을 개발했습니다. 우리의 모델은 이러한 스트리밍 능력을 달성하고, 세 가지 밀집 비디오 캡셔닝 벤치마크(ActivityNet, YouCook2, ViTT)에서 최첨단 기술을 크게 개선했습니다. 우리의 코드는 https://github.com/google-research/scenic에서 공개되었습니다.

English

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic.

스트리밍 방식의 고밀도 비디오 캡션 생성

Streaming Dense Video Captioning

초록

Support