PiTe：用於大型視頻-語言模型的像素-時間對齊

摘要

在大型語言模型（LLMs）浪潮的推動下，大型視覺語言模型（LVLMs）已經成為一個重要的進步，彌合了影像和文字之間的差距。然而，由於語言和時空數據結構之間的關係複雜，使得視頻變得具有挑戰性，這導致LVLMs難以表現出色。最近的大型視頻語言模型（LVidLMs）將靜態視覺數據的特徵，如圖像，對齊到語言特徵的潛在空間中，通過通用多模式任務充分利用LLMs的能力。在本文中，我們通過物體軌跡的細粒度對齊方法探索不同模態在空間和時間維度上的對齊。因此，我們提出了一種新穎的LVidLM，通過軌跡引導的像素-時間對齊，被稱為PiTe，展現出有前途的可應用模型特性。為了實現精細的視頻語言對齊，我們編纂了一個多模式預訓練數據集PiTe-143k，通過我們的自動標註流程，為所有在視頻和字幕中出現並提到的個別物體提供了像素級移動軌跡的數據集。同時，PiTe通過大幅度超越最先進的方法，在眾多與視頻相關的多模式任務上展現出令人驚嘆的能力。

English

Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models (LVLMs) have emerged as a pivotal advancement, bridging the gap between image and text. However, video making it challenging for LVLMs to perform adequately due to the complexity of the relationship between language and spatial-temporal data structure. Recent Large Video-Language Models (LVidLMs) align feature of static visual data like image into latent space of language feature, by general multi-modal tasks to leverage abilities of LLMs sufficiently. In this paper, we explore fine-grained alignment approach via object trajectory for different modalities across both spatial and temporal dimensions simultaneously. Thus, we propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed PiTe, that exhibits promising applicable model property. To achieve fine-grained video-language alignment, we curate a multi-modal pre-training dataset PiTe-143k, the dataset provision of moving trajectories in pixel level for all individual objects, that appear and mention in the video and caption both, by our automatic annotation pipeline. Meanwhile, PiTe demonstrates astounding capabilities on myriad video-related multi-modal tasks through beat the state-of-the-art methods by a large margin.

PiTe：用於大型視頻-語言模型的像素-時間對齊

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

摘要

Support