EgoVLPv2: 백본 내 융합을 통한 자기 중심적 비디오-언어 사전 학습

초록

비디오-언어 사전 학습(VLP)은 다양한 시각 및 언어 작업으로 일반화할 수 있는 능력으로 인해 점점 더 중요해지고 있습니다. 그러나 기존의 자기 중심적 VLP 프레임워크는 별도의 비디오와 언어 인코더를 사용하며, 미세 조정 단계에서만 작업별 교차 모달 정보를 학습함으로써 통합 시스템의 발전을 제한하고 있습니다. 본 연구에서는 이전 세대에서 크게 개선된 두 번째 세대의 자기 중심적 비디오-언어 사전 학습(EgoVLPv2)을 소개합니다. EgoVLPv2는 비디오와 언어 백본에 직접 교차 모달 융합을 통합하여, 사전 학습 단계에서 강력한 비디오-텍스트 표현을 학습하고, 교차 모달 어텐션 모듈을 재사용하여 다양한 다운스트림 작업을 유연하고 효율적으로 지원함으로써 미세 조정 비용을 줄입니다. 또한, 우리가 제안한 백본 내 융합 전략은 추가적인 융합 전용 레이어를 쌓는 방식보다 더 가볍고 계산 효율적입니다. 다양한 VL 작업에 대한 광범위한 실험을 통해 EgoVLPv2의 효과를 입증하였으며, 모든 다운스트림 작업에서 강력한 베이스라인을 일관되게 뛰어넘는 최첨단 성능을 달성했습니다. 우리의 프로젝트 페이지는 https://shramanpramanick.github.io/EgoVLPv2/에서 확인할 수 있습니다.

English

Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream. Our project page can be found at https://shramanpramanick.github.io/EgoVLPv2/.

EgoVLPv2: 백본 내 융합을 통한 자기 중심적 비디오-언어 사전 학습

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

초록

Support