VidLA:規模化的影片語言對齊
VidLA: Video-Language Alignment at Scale
March 21, 2024
作者: Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi
cs.AI
摘要
本文提出了VidLA,一種用於大規模視頻語言對齊的方法。先前的視頻語言對齊方法存在兩個主要限制。首先,它們未捕獲短程和長程時間依賴性,通常採用複雜的分層深度網絡架構,難以與現有的預訓練圖像文本基礎模型集成。為了有效解決這一限制,我們保持網絡架構簡單,並使用一組以分層方式以不同時間分辨率運作的數據標記,以考慮視頻的時間分層性質。通過採用簡單的雙塔架構,我們能夠使用預訓練的圖像文本基礎模型初始化我們的視頻語言模型,從而提高最終性能。其次,現有的視頻語言對齊工作由於缺乏語義對齊的大規模訓練數據而面臨困難。為了克服這一問題,我們利用最近的LLM來精心策劃迄今為止最大的視頻語言數據集,實現更好的視覺基礎。此外,與現有僅包含短片段的視頻文本數據集不同,我們的數據集豐富多樣,包含不同持續時間的視頻片段,以幫助我們的時間分層數據標記在不同時間尺度上提取更好的表示。總的來說,實證結果表明,我們提出的方法在多個檢索基準上超越了最先進的方法,特別是在較長的視頻上,並在分類基準上表現出競爭力。
English
In this paper, we propose VidLA, an approach for video-language alignment at
scale. There are two major limitations of previous video-language alignment
approaches. First, they do not capture both short-range and long-range temporal
dependencies and typically employ complex hierarchical deep network
architectures that are hard to integrate with existing pretrained image-text
foundation models. To effectively address this limitation, we instead keep the
network architecture simple and use a set of data tokens that operate at
different temporal resolutions in a hierarchical manner, accounting for the
temporally hierarchical nature of videos. By employing a simple two-tower
architecture, we are able to initialize our video-language model with
pretrained image-text foundation models, thereby boosting the final
performance. Second, existing video-language alignment works struggle due to
the lack of semantically aligned large-scale training data. To overcome it, we
leverage recent LLMs to curate the largest video-language dataset to date with
better visual grounding. Furthermore, unlike existing video-text datasets which
only contain short clips, our dataset is enriched with video clips of varying
durations to aid our temporally hierarchical data tokens in extracting better
representations at varying temporal scales. Overall, empirical results show
that our proposed approach surpasses state-of-the-art methods on multiple
retrieval benchmarks, especially on longer videos, and performs competitively
on classification benchmarks.Summary
AI-Generated Summary