xGen-MM-Vid（BLIP-3-Video）：ビデオを表現するにはわずか32トークンしか必要ありません、VLMsにおいても

要旨

xGen-MM-Vid（BLIP-3-Video）を紹介します：このビデオ向けのマルチモーダル言語モデルは、特に複数のフレームにわたる時間情報を効率的に捉えるよう設計されています。BLIP-3-Videoは、従来のビジュアルトークナイザーに加えて「時間エンコーダー」を活用し、複数フレーム上のトークンのシーケンスをコンパクトなビジュアルトークンのセットにマッピングします。これにより、BLIP3-Videoは、競合するモデル（例：32対4608トークン）よりもはるかに少ないビジュアルトークンを使用できます。我々は、学習可能な時空間プーリングやToken Turing Machinesなどのシーケンシャルモデルを含むさまざまなタイプの時間エンコーダーを探求します。実験的に、BLIP-3-Videoが、はるかに大きな最先端モデル（例：34B）と比較してビデオに関する質問応答の精度を達成し、より少ないビジュアルトークンを使用することではるかに小さく（すなわち4B）かつ効率的であることを確認します。プロジェクトのウェブサイトは以下にあります：https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

English

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

xGen-MM-Vid（BLIP-3-Video）：ビデオを表現するにはわずか32トークンしか必要ありません、VLMsにおいても

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

要旨

Support