VidLA: 대규모 비디오-언어 정렬

초록

본 논문에서는 대규모 비디오-언어 정렬을 위한 VidLA 접근법을 제안한다. 기존 비디오-언어 정렬 접근법에는 두 가지 주요 한계점이 있다. 첫째, 단기 및 장기 시간적 의존성을 모두 포착하지 못하며, 일반적으로 기존의 사전 학습된 이미지-텍스트 기반 모델과 통합하기 어려운 복잡한 계층적 심층 네트워크 아키텍처를 사용한다. 이러한 한계를 효과적으로 해결하기 위해, 우리는 네트워크 아키텍처를 단순하게 유지하고 비디오의 시간적 계층적 특성을 고려하여 다양한 시간적 해상도에서 작동하는 데이터 토큰 세트를 계층적으로 사용한다. 간단한 두 타워(two-tower) 아키텍처를 사용함으로써, 사전 학습된 이미지-텍스트 기반 모델로 비디오-언어 모델을 초기화하여 최종 성능을 향상시킬 수 있다. 둘째, 기존의 비디오-언어 정렬 연구는 의미적으로 정렬된 대규모 학습 데이터의 부족으로 어려움을 겪는다. 이를 극복하기 위해, 우리는 최신 대형 언어 모델(LLM)을 활용하여 지금까지 가장 큰 비디오-언어 데이터셋을 구축하고 더 나은 시각적 근거를 제공한다. 또한, 기존의 비디오-텍스트 데이터셋이 짧은 클립만 포함하는 것과 달리, 우리의 데이터셋은 다양한 지속 시간의 비디오 클립으로 풍부하게 구성되어 시간적 계층적 데이터 토큰이 다양한 시간적 규모에서 더 나은 표현을 추출할 수 있도록 돕는다. 전반적으로, 실험 결과는 우리가 제안한 접근법이 여러 검색 벤치마크에서 최신 방법을 능가하며, 특히 긴 비디오에서 더 나은 성능을 보이고, 분류 벤치마크에서도 경쟁력 있는 성능을 보임을 입증한다.

English

In this paper, we propose VidLA, an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First, they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation, we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner, accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture, we are able to initialize our video-language model with pretrained image-text foundation models, thereby boosting the final performance. Second, existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it, we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore, unlike existing video-text datasets which only contain short clips, our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall, empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks, especially on longer videos, and performs competitively on classification benchmarks.

VidLA: 대규모 비디오-언어 정렬

VidLA: Video-Language Alignment at Scale

초록

Summary

Support