VLog: ナレーションの生成的検索によるビデオ-言語モデル語彙

要旨

人間の日常活動は、ビデオストリームにおける一連のルーチンイベント（例：アラームを止める）として簡潔に語られることができ、これがイベント語彙を形成します。これに着想を得て、我々はVLogを紹介します。これは、既存の生成型ビデオ言語モデルで一般的なサブワード語彙を超えて、ビデオのナレーションを語彙として定義する新しいビデオ理解フレームワークです。軽量な言語モデルGPT-2を基盤とするVLogは、以下の3つの主要な革新を特徴とします：(i) 言語モデルの複雑な推論能力と対照検索の効率的な類似性検索を組み合わせた生成型検索モデル。(ii) 大規模なビデオナレーションから我々のナレーションペアエンコーディングアルゴリズムを用いて導出された階層的語彙。これにより、広範なシナリオ（例：キッチン）を特定し、表現力豊かな接尾辞（例：左手で）を用いて特定のイベント（例：トマトを切る）を効率的に索引付けできます。(iii) 推論中に遭遇する新しいイベントに対して語彙を拡張するための生成モデルを活用した語彙更新戦略。我々のアプローチを検証するために、推論関係（例：前後関係）を伴う簡潔なナレーションを必要とする開発セットVidCap-Evalを導入しました。EgoSchema、COIN、HiRESTでの実験は、VLogの有効性をさらに実証し、簡潔で文脈的に正確かつ効率的なナレーションを生成する能力を強調し、ビデオ理解に対する新しい視点を提供します。コードはhttps://github.com/showlab/VLogで公開されています。

English

Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog feature three key innovations: (i) A generative retrieval model, marrying language model's complex reasoning capabilities with contrastive retrieval's efficient similarity search. (ii) A hierarchical vocabulary derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary update strategy leveraging generative models to extend the vocabulary for novel events encountered during inference. To validate our approach, we introduce VidCap-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations, offering a novel perspective on video understanding. Codes are released at https://github.com/showlab/VLog.

VLog: ナレーションの生成的検索によるビデオ-言語モデル語彙

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

要旨

Support