ChatPaper.aiChatPaper

xGen-MM-Vid(BLIP-3-Video):您只需要32個令牌來表示一段影片,即使在VLMs中。

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

October 21, 2024
作者: Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles
cs.AI

摘要

我們介紹了 xGen-MM-Vid (BLIP-3-Video):一個針對影片的多模式語言模型,特別設計來有效地捕捉多幀的時間信息。BLIP-3-Video 利用了「時間編碼器」,除了傳統的視覺分詞器外,將一系列的標記映射到多幀中,形成一組緊湊的視覺標記。這使得 BLIP3-Video 能夠使用比競爭模型(例如,32 對 4608 個標記)少得多的視覺標記。我們探索了不同類型的時間編碼器,包括可學習的時空池化以及像 Token Turing Machines 這樣的序列模型。我們實驗證實,BLIP-3-Video 在視頻問答準確性方面與更大的最先進模型(例如,34B)相當,同時體積更小(即 4B),並且通過使用更少的視覺標記更有效率。該項目網站位於 https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html
English
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

Summary

AI-Generated Summary

PDF182November 16, 2024