PiTe: 大規模ビデオ言語モデルのためのピクセル-時間的アライメント

要旨

大規模言語モデル（LLMs）の波によって推進され、大規模ビジュアル言語モデル（LVLMs）が画像とテキストの間のギャップを埋める画期的な進歩として登場しました。ただし、ビデオは言語と空間的時間データ構造の複雑な関係のため、LVLMsが適切に機能することが難しい状況となっています。最近の大規模ビデオ言語モデル（LVidLMs）は、静的なビジュアルデータの特徴を言語特徴の潜在空間に整列させることで、LLMsの能力を十分に活用するために一般的な多様なタスクを調整しています。本論文では、物体の軌跡を介した細かい整列アプローチを探求し、空間的および時間的次元の両方で異なるモダリティにわたるものです。したがって、私たちはトラジェクトリによるピクセル時間整列という新しいLVidLMを提案し、PiTeと名付け、有望な適用可能なモデル特性を示しています。細かいビデオ言語の整列を実現するために、動画とキャプションの両方に現れ、言及されるすべての個々のオブジェクトのピクセルレベルでの移動軌跡のデータセットPiTe-143kを収集し、自動注釈パイプラインを使用しています。一方、PiTeは、多様なビデオ関連の多様なタスクで驚異的な能力を示し、最先端の手法を大幅に凌駕しています。

English

Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models (LVLMs) have emerged as a pivotal advancement, bridging the gap between image and text. However, video making it challenging for LVLMs to perform adequately due to the complexity of the relationship between language and spatial-temporal data structure. Recent Large Video-Language Models (LVidLMs) align feature of static visual data like image into latent space of language feature, by general multi-modal tasks to leverage abilities of LLMs sufficiently. In this paper, we explore fine-grained alignment approach via object trajectory for different modalities across both spatial and temporal dimensions simultaneously. Thus, we propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed PiTe, that exhibits promising applicable model property. To achieve fine-grained video-language alignment, we curate a multi-modal pre-training dataset PiTe-143k, the dataset provision of moving trajectories in pixel level for all individual objects, that appear and mention in the video and caption both, by our automatic annotation pipeline. Meanwhile, PiTe demonstrates astounding capabilities on myriad video-related multi-modal tasks through beat the state-of-the-art methods by a large margin.

PiTe: 大規模ビデオ言語モデルのためのピクセル-時間的アライメント

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

要旨

Support