VideoGLaMM:一个用于视频中像素级视觉定位的大型多模态模型
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
November 7, 2024
作者: Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, Salman Khan
cs.AI
摘要
由于视频中复杂的空间和时间动态,视频和文本之间的细粒度对齐具有挑战性。现有基于视频的大型多模态模型(LMMs)可以处理基本对话,但在视频中精确的像素级对齐方面存在困难。为了解决这个问题,我们引入了VideoGLaMM,这是一个专为视频中基于用户提供的文本输入进行细粒度像素级对齐设计的LMM。我们的设计无缝连接了三个关键组件:一个大型语言模型,一个双视觉编码器,强调空间和时间细节,以及一个用于准确生成蒙版的时空解码器。这种连接是通过可调的V-L和L-V适配器实现的,这些适配器可以实现紧密的视觉-语言对齐。该架构经过训练,以使视频内容的空间和时间元素与文本指令同步。为了实现细粒度对齐,我们策划了一个多模态数据集,其中包含使用半自动注释流程详细可视化对齐的对话,结果是一组包括38k视频-QA三元组、83k对象和671k蒙版的多样化数据。我们在三个具有挑战性的任务上评估了VideoGLaMM:对话生成、视觉对齐和视频引用分割。实验结果表明,我们的模型在所有三个任务中始终优于现有方法。
English
Fine-grained alignment between videos and text is challenging due to complex
spatial and temporal dynamics in videos. Existing video-based Large Multimodal
Models (LMMs) handle basic conversations but struggle with precise pixel-level
grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed
for fine-grained pixel-level grounding in videos based on user-provided textual
inputs. Our design seamlessly connects three key components: a Large Language
Model, a dual vision encoder that emphasizes both spatial and temporal details,
and a spatio-temporal decoder for accurate mask generation. This connection is
facilitated via tunable V-L and L-V adapters that enable close Vision-Language
(VL) alignment. The architecture is trained to synchronize both spatial and
temporal elements of video content with textual instructions. To enable
fine-grained grounding, we curate a multimodal dataset featuring detailed
visually-grounded conversations using a semiautomatic annotation pipeline,
resulting in a diverse set of 38k video-QA triplets along with 83k objects and
671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded
Conversation Generation, Visual Grounding, and Referring Video Segmentation.
Experimental results show that our model consistently outperforms existing
approaches across all three tasks.