VideoGLaMM:一個大型多模型模型,用於視頻中的像素級視覺定位
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
November 7, 2024
作者: Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, Salman Khan
cs.AI
摘要
由於影片中複雜的空間和時間動態,影片與文本之間的細粒度對齊具有挑戰性。現有基於影片的大型多模型模型(LMM)可以處理基本對話,但在影片中精確的像素級對齊方面表現不佳。為了應對這一問題,我們引入了VideoGLaMM,這是一種針對影片中細粒度像素級對齊而設計的LMM,基於用戶提供的文本輸入。我們的設計無縫地連接了三個關鍵組件:一個大型語言模型、一個強調空間和時間細節的雙視覺編碼器,以及一個用於準確生成遮罩的時空解碼器。通過可調的V-L和L-V適配器實現了緊密的視覺-語言對齊。該架構被訓練以將影片內容的空間和時間元素與文本指示同步。為了實現細粒度對齊,我們通過半自動標註流程精心策劃了一個多模態數據集,其中包含了詳細的視覺對齊對話,結果形成了一個包含38k個視頻問答三元組、83k個對象和671k個遮罩的多樣化數據集。我們在三個具有挑戰性的任務上評估了VideoGLaMM:對話生成對齊、視覺對齊和參考視頻分割。實驗結果表明,我們的模型在所有三個任務中始終優於現有方法。
English
Fine-grained alignment between videos and text is challenging due to complex
spatial and temporal dynamics in videos. Existing video-based Large Multimodal
Models (LMMs) handle basic conversations but struggle with precise pixel-level
grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed
for fine-grained pixel-level grounding in videos based on user-provided textual
inputs. Our design seamlessly connects three key components: a Large Language
Model, a dual vision encoder that emphasizes both spatial and temporal details,
and a spatio-temporal decoder for accurate mask generation. This connection is
facilitated via tunable V-L and L-V adapters that enable close Vision-Language
(VL) alignment. The architecture is trained to synchronize both spatial and
temporal elements of video content with textual instructions. To enable
fine-grained grounding, we curate a multimodal dataset featuring detailed
visually-grounded conversations using a semiautomatic annotation pipeline,
resulting in a diverse set of 38k video-QA triplets along with 83k objects and
671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded
Conversation Generation, Visual Grounding, and Referring Video Segmentation.
Experimental results show that our model consistently outperforms existing
approaches across all three tasks.