ViSMaP: 메타 프롬프팅을 통한 무감독 장시간 비디오 요약

초록

ViSMap: Unsupervised Video Summarisation by Meta Prompting을 소개합니다. 이 시스템은 감독 없이도 시간 단위의 긴 동영상을 요약할 수 있습니다. 대부분의 기존 동영상 이해 모델은 사전 분할된 짧은 이벤트 동영상에서는 잘 작동하지만, 관련 이벤트가 드물게 분포하고 사전 분할되지 않은 긴 동영상을 요약하는 데는 어려움을 겪습니다. 또한, 장편 동영상 이해는 종종 광범위한 주석이 필요한 지도 학습 기반의 계층적 훈련에 의존하는데, 이는 비용이 많이 들고 느릴 뿐만 아니라 일관성 유지가 어렵습니다. ViSMap은 짧은 동영상(주석 데이터가 풍부한 경우)과 긴 동영상(주석 데이터가 부족한 경우) 간의 격차를 해소합니다. 우리는 대형 언어 모델(LLM)을 활용하여 짧은 동영상의 세그먼트 설명을 기반으로 긴 동영상의 최적화된 가짜 요약을 생성합니다. 이러한 가짜 요약은 긴 동영상의 비용이 많이 드는 주석 없이도 장편 동영상 요약을 생성하는 모델의 훈련 데이터로 사용됩니다. 구체적으로, 우리는 메타 프롬프팅 전략을 채택하여 긴 동영상의 가짜 요약을 반복적으로 생성하고 개선합니다. 이 전략은 지도 학습된 짧은 동영상 모델에서 얻은 짧은 클립 설명을 활용하여 요약을 안내합니다. 각 반복은 세 개의 LLM이 순차적으로 작동합니다: 하나는 클립 설명에서 가짜 요약을 생성하고, 다른 하나는 이를 평가하며, 세 번째는 생성기의 프롬프트를 최적화합니다. 이 반복은 가짜 요약의 품질이 생성기 프롬프트에 크게 의존하며 동영상마다 크게 달라지기 때문에 필요합니다. 우리는 여러 데이터셋에서 요약 결과를 광범위하게 평가했으며, ViSMap이 완전히 지도 학습된 최첨단 모델과 비슷한 성능을 달성하면서도 성능 저하 없이 다양한 도메인에 일반화할 수 있음을 보여줍니다. 코드는 출판 시 공개될 예정입니다.

English

We introduce ViSMap: Unsupervised Video Summarisation by Meta Prompting, a system to summarise hour long videos with no-supervision. Most existing video understanding models work well on short videos of pre-segmented events, yet they struggle to summarise longer videos where relevant events are sparsely distributed and not pre-segmented. Moreover, long-form video understanding often relies on supervised hierarchical training that needs extensive annotations which are costly, slow and prone to inconsistency. With ViSMaP we bridge the gap between short videos (where annotated data is plentiful) and long ones (where it's not). We rely on LLMs to create optimised pseudo-summaries of long videos using segment descriptions from short ones. These pseudo-summaries are used as training data for a model that generates long-form video summaries, bypassing the need for expensive annotations of long videos. Specifically, we adopt a meta-prompting strategy to iteratively generate and refine creating pseudo-summaries of long videos. The strategy leverages short clip descriptions obtained from a supervised short video model to guide the summary. Each iteration uses three LLMs working in sequence: one to generate the pseudo-summary from clip descriptions, another to evaluate it, and a third to optimise the prompt of the generator. This iteration is necessary because the quality of the pseudo-summaries is highly dependent on the generator prompt, and varies widely among videos. We evaluate our summaries extensively on multiple datasets; our results show that ViSMaP achieves performance comparable to fully supervised state-of-the-art models while generalising across domains without sacrificing performance. Code will be released upon publication.

ViSMaP: 메타 프롬프팅을 통한 무감독 장시간 비디오 요약

ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting

초록

Support