VLog:基於敘事生成檢索的視頻-語言模型詞彙表
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
March 12, 2025
作者: Kevin Qinghong Lin, Mike Zheng Shou
cs.AI
摘要
人類日常活動可以簡潔地描述為視頻流中的一系列常規事件(例如,關閉鬧鐘),從而形成一個事件詞彙表。基於此,我們引入了VLog,這是一種新穎的視頻理解框架,它將視頻敘述定義為詞彙,超越了現有生成式視頻-語言模型中典型的子詞詞彙。VLog建立在輕量級語言模型GPT-2之上,具有三個關鍵創新:(i) 一個生成式檢索模型,將語言模型的複雜推理能力與對比檢索的高效相似性搜索相結合。(ii) 一個從大規模視頻敘述中通過我們的敘述對編碼算法導出的分層詞彙表,能夠通過識別更廣泛的場景(例如,廚房)並使用表達性後綴(例如,用左手)來高效索引特定事件(例如,切番茄)。(iii) 一種利用生成模型擴展詞彙表的策略,以應對推理過程中遇到的新事件。為了驗證我們的方法,我們引入了VidCap-Eval,這是一個需要簡潔敘述並包含推理關係(例如,之前和之後)的開發集。在EgoSchema、COIN和HiREST上的實驗進一步證明了VLog的有效性,突顯了其生成簡潔、上下文準確且高效的敘述的能力,為視頻理解提供了新的視角。代碼已發佈於https://github.com/showlab/VLog。
English
Human daily activities can be concisely narrated as sequences of routine
events (e.g., turning off an alarm) in video streams, forming an event
vocabulary. Motivated by this, we introduce VLog, a novel video understanding
framework that define video narrations as vocabulary, going beyond the typical
subword vocabularies in existing generative video-language models. Built on the
lightweight language model GPT-2, VLog feature three key innovations: (i) A
generative retrieval model, marrying language model's complex reasoning
capabilities with contrastive retrieval's efficient similarity search. (ii) A
hierarchical vocabulary derived from large-scale video narrations using our
narration pair encoding algorithm, enabling efficient indexing of specific
events (e.g., cutting a tomato) by identifying broader scenarios (e.g.,
kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary
update strategy leveraging generative models to extend the vocabulary for novel
events encountered during inference. To validate our approach, we introduce
VidCap-Eval, a development set requiring concise narrations with reasoning
relationships (e.g., before and after). Experiments on EgoSchema, COIN, and
HiREST further demonstrate the effectiveness of VLog, highlighting its ability
to generate concise, contextually accurate, and efficient narrations, offering
a novel perspective on video understanding. Codes are released at
https://github.com/showlab/VLog.Summary
AI-Generated Summary