VidLA:大规模视频-语言对齐

VidLA: Video-Language Alignment at Scale

March 21, 2024
作者: Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi
cs.AI

摘要

本文提出了VidLA,一种用于大规模视频-语言对齐的方法。先前的视频-语言对齐方法存在两个主要限制。首先,它们未能捕捉短程和长程时间依赖关系,并且通常采用复杂的分层深度网络架构,难以与现有的预训练图像-文本基础模型集成。为了有效解决这一限制,我们选择保持网络架构简单,并使用一组以分层方式以不同时间分辨率运行的数据标记,考虑到视频的时间分层性质。通过采用简单的双塔架构,我们能够使用预训练的图像-文本基础模型初始化我们的视频-语言模型,从而提高最终性能。其次,现有的视频-语言对齐工作由于缺乏语义对齐的大规模训练数据而面临困难。为了克服这一问题,我们利用最近的LLM来策划迄今为止最大的视频-语言数据集,具有更好的视觉基础。此外,与现有的仅包含短视频片段的视频-文本数据集不同,我们的数据集包含各种持续时间的视频片段,以帮助我们的时间分层数据标记在不同时间尺度上提取更好的表示。总体而言,实证结果表明我们提出的方法在多个检索基准上超越了最先进的方法,特别是在较长视频上,并在分类基准上表现出竞争力。
English
In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.

Summary

AI-Generated Summary

PDF141December 15, 2024