ViSMaP：メタプロンプティングによる教師なし長時間動画要約

要旨

ViSMap: Unsupervised Video Summarisation by Meta Promptingを紹介します。これは、教師なしで長時間の動画を要約するシステムです。既存の動画理解モデルの多くは、事前に分割された短い動画に対しては良好に機能しますが、関連するイベントがまばらに分布し、事前に分割されていない長時間の動画の要約には苦戦します。さらに、長時間の動画理解は、大規模なアノテーションを必要とする教師あり階層的トレーニングに依存することが多く、これにはコストがかかり、時間がかかり、一貫性に欠ける傾向があります。ViSMaPでは、短い動画（アノテーションデータが豊富）と長い動画（アノテーションデータが不足）の間のギャップを埋めます。我々は、短い動画から得られたセグメント記述を使用して、長時間の動画の最適化された疑似要約を作成するためにLLMを利用します。これらの疑似要約は、長時間の動画の要約を生成するモデルのトレーニングデータとして使用され、高価な長時間動画のアノテーションの必要性を回避します。具体的には、メタプロンプティング戦略を採用して、長時間の動画の疑似要約を反復的に生成および改良します。この戦略は、教師あり短い動画モデルから得られた短いクリップ記述を活用して要約を導きます。各反復では、3つのLLMが順番に動作します。1つはクリップ記述から疑似要約を生成し、もう1つはそれを評価し、3つ目は生成器のプロンプトを最適化します。この反復は、疑似要約の品質が生成器のプロンプトに大きく依存し、動画によって大きく異なるため必要です。我々は、複数のデータセットで要約を広範囲に評価しました。その結果、ViSMaPは、完全に教師ありの最先端モデルに匹敵する性能を達成し、性能を犠牲にすることなくドメイン間で一般化できることが示されました。コードは公開時にリリースされます。

English

We introduce ViSMap: Unsupervised Video Summarisation by Meta Prompting, a system to summarise hour long videos with no-supervision. Most existing video understanding models work well on short videos of pre-segmented events, yet they struggle to summarise longer videos where relevant events are sparsely distributed and not pre-segmented. Moreover, long-form video understanding often relies on supervised hierarchical training that needs extensive annotations which are costly, slow and prone to inconsistency. With ViSMaP we bridge the gap between short videos (where annotated data is plentiful) and long ones (where it's not). We rely on LLMs to create optimised pseudo-summaries of long videos using segment descriptions from short ones. These pseudo-summaries are used as training data for a model that generates long-form video summaries, bypassing the need for expensive annotations of long videos. Specifically, we adopt a meta-prompting strategy to iteratively generate and refine creating pseudo-summaries of long videos. The strategy leverages short clip descriptions obtained from a supervised short video model to guide the summary. Each iteration uses three LLMs working in sequence: one to generate the pseudo-summary from clip descriptions, another to evaluate it, and a third to optimise the prompt of the generator. This iteration is necessary because the quality of the pseudo-summaries is highly dependent on the generator prompt, and varies widely among videos. We evaluate our summaries extensively on multiple datasets; our results show that ViSMaP achieves performance comparable to fully supervised state-of-the-art models while generalising across domains without sacrificing performance. Code will be released upon publication.

ViSMaP：メタプロンプティングによる教師なし長時間動画要約

ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting

要旨

Support