ChatPaper.aiChatPaper

xGen-MM-Vid(BLIP-3-Video):您只需32个标记即可在VLM中表示一个视频

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

October 21, 2024
作者: Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles
cs.AI

摘要

我们提出了xGen-MM-Vid(BLIP-3-Video):一种用于视频的多模态语言模型,特别设计用于高效地捕获多帧的时间信息。BLIP-3-Video利用了“时间编码器”,除了传统的视觉标记器外,还将多帧的标记序列映射为紧凑的视觉标记集。这使得BLIP3-Video可以使用比竞争模型(例如32对4608个标记)少得多的视觉标记。我们探讨了不同类型的时间编码器,包括可学习的时空池化以及像Token Turing Machines这样的顺序模型。实验证实,BLIP-3-Video获得了视频问答准确度,与更大型的最先进模型(例如34B)相当,同时体积更小(即4B),并通过使用更少的视觉标记更高效。该项目网站位于https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html
English
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

Summary

AI-Generated Summary

PDF182November 16, 2024