PiTe:用于大型视频-语言模型的像素-时间对齐
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
September 11, 2024
作者: Yang Liu, Pengxiang Ding, Siteng Huang, Min Zhang, Han Zhao, Donglin Wang
cs.AI
摘要
受大型语言模型(LLMs)浪潮的推动,大型视觉语言模型(LVLMs)已经成为一个重要的进步,弥合了图像和文本之间的差距。然而,由于语言和时空数据结构之间关系的复杂性,视频使得LVLMs难以表现出色。最近的大型视频语言模型(LVidLMs)通过将静态视觉数据的特征(如图像)与语言特征的潜在空间进行对齐,通过通用的多模态任务充分利用LLMs的能力。在本文中,我们通过物体轨迹探索了一种细粒度对齐方法,同时跨越空间和时间维度的不同模态。因此,我们提出了一种新颖的LVidLM,通过轨迹引导的像素-时间对齐,命名为PiTe,展现出有前景的适用模型特性。为了实现细粒度的视频语言对齐,我们策划了一个多模态预训练数据集 PiTe-143k,该数据集提供了视频中出现并在标题中提及的所有个体物体的像素级移动轨迹,通过我们的自动注释流程。同时,PiTe在众多与视频相关的多模态任务上展现出惊人的能力,大幅领先于现有技术方法。
English
Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models
(LVLMs) have emerged as a pivotal advancement, bridging the gap between image
and text. However, video making it challenging for LVLMs to perform adequately
due to the complexity of the relationship between language and spatial-temporal
data structure. Recent Large Video-Language Models (LVidLMs) align feature of
static visual data like image into latent space of language feature, by general
multi-modal tasks to leverage abilities of LLMs sufficiently. In this paper, we
explore fine-grained alignment approach via object trajectory for different
modalities across both spatial and temporal dimensions simultaneously. Thus, we
propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed
PiTe, that exhibits promising applicable model property. To achieve
fine-grained video-language alignment, we curate a multi-modal pre-training
dataset PiTe-143k, the dataset provision of moving trajectories in pixel level
for all individual objects, that appear and mention in the video and caption
both, by our automatic annotation pipeline. Meanwhile, PiTe demonstrates
astounding capabilities on myriad video-related multi-modal tasks through beat
the state-of-the-art methods by a large margin.Summary
AI-Generated Summary