一段影片價值4096個標記：以零編碼方式口語化故事影片以理解其內容

摘要

多媒體內容，如廣告和故事影片，展現了豐富的創意和多種模式。它們融合了文本、視覺、音訊和敘事技巧等元素，運用情感、象徵和口號等手法來傳達意義。雖然先前在多媒體理解方面的研究主要集中在具有特定動作的影片，如烹飪，但缺乏大規模標註的訓練數據集，阻礙了對現實應用中表現滿意的監督式學習模型的發展。然而，大型語言模型（LLMs）的崛起在各種自然語言處理（NLP）任務中見證了卓越的零-shot表現，如情感分類、問答和主題分類。為了彌補多媒體理解中的這一性能差距，我們提出了將故事影片轉述為自然語言生成其描述，然後對生成的故事執行視頻理解任務，而不是原始影片。通過對五個視頻理解任務進行廣泛實驗，我們證明了我們的方法，儘管是零-shot，但在視頻理解方面取得了顯著比監督基線更好的結果。此外，為了緩解對故事理解基準的不足，我們公開發布了第一個關鍵的計算社會科學任務數據集，即說服策略識別。

English

Multimedia content, such as advertisements and story videos, exhibit a rich blend of creativity and multiple modalities. They incorporate elements like text, visuals, audio, and storytelling techniques, employing devices like emotions, symbolism, and slogans to convey meaning. While previous research in multimedia understanding has focused mainly on videos with specific actions like cooking, there is a dearth of large annotated training datasets, hindering the development of supervised learning models with satisfactory performance for real-world applications. However, the rise of large language models (LLMs) has witnessed remarkable zero-shot performance in various natural language processing (NLP) tasks, such as emotion classification, question-answering, and topic classification. To bridge this performance gap in multimedia understanding, we propose verbalizing story videos to generate their descriptions in natural language and then performing video-understanding tasks on the generated story as opposed to the original video. Through extensive experiments on five video-understanding tasks, we demonstrate that our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. Further, alleviating a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science, persuasion strategy identification.

一段影片價值4096個標記：以零編碼方式口語化故事影片以理解其內容

A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot

摘要

Support