動画は4096トークンに値する：ストーリー動画を言語化してゼロショットで理解する

要旨

広告やストーリービデオなどのマルチメディアコンテンツは、創造性と複数のモダリティが豊かに融合したものです。これらは、テキスト、視覚要素、音声、ストーリーテリング技術などの要素を取り入れ、感情、象徴、スローガンなどの手法を用いて意味を伝えます。これまでのマルチメディア理解の研究は、料理のような特定のアクションを含むビデオに主に焦点を当ててきましたが、大規模な注釈付きトレーニングデータセットの不足により、実世界のアプリケーションで満足のいく性能を発揮する教師あり学習モデルの開発が妨げられてきました。しかし、大規模言語モデル（LLM）の台頭により、感情分類、質問応答、トピック分類などのさまざまな自然言語処理（NLP）タスクで驚異的なゼロショット性能が実証されています。マルチメディア理解におけるこの性能ギャップを埋めるために、私たちはストーリービデオを言語化して自然言語でその説明を生成し、元のビデオではなく生成されたストーリーに対してビデオ理解タスクを実行することを提案します。5つのビデオ理解タスクに関する広範な実験を通じて、私たちの方法がゼロショットであるにもかかわらず、ビデオ理解のための教師ありベースラインよりも大幅に優れた結果を達成することを実証します。さらに、ストーリー理解のベンチマークの不足を緩和するために、計算社会科学における重要なタスクである説得戦略識別に関する最初のデータセットを公開します。

English

Multimedia content, such as advertisements and story videos, exhibit a rich blend of creativity and multiple modalities. They incorporate elements like text, visuals, audio, and storytelling techniques, employing devices like emotions, symbolism, and slogans to convey meaning. While previous research in multimedia understanding has focused mainly on videos with specific actions like cooking, there is a dearth of large annotated training datasets, hindering the development of supervised learning models with satisfactory performance for real-world applications. However, the rise of large language models (LLMs) has witnessed remarkable zero-shot performance in various natural language processing (NLP) tasks, such as emotion classification, question-answering, and topic classification. To bridge this performance gap in multimedia understanding, we propose verbalizing story videos to generate their descriptions in natural language and then performing video-understanding tasks on the generated story as opposed to the original video. Through extensive experiments on five video-understanding tasks, we demonstrate that our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. Further, alleviating a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science, persuasion strategy identification.

動画は4096トークンに値する：ストーリー動画を言語化してゼロショットで理解する

A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot

要旨

Support