EgoVLPv2：具融合功能的自我中心影像語言預訓練於主幹网络

摘要

由於影像語言預訓練（VLP）能夠泛化到各種視覺和語言任務，因此已變得日益重要。然而，現有的自我中心VLP框架使用獨立的影像和語言編碼器，並僅在微調期間學習任務特定的跨模態信息，限制了統一系統的發展。在這項工作中，我們介紹了第二代自我中心影像語言預訓練（EgoVLPv2），這是對上一代的重大改進，通過將跨模態融合直接融入影像和語言主幹。EgoVLPv2在預訓練期間學習強大的影像文本表示，並重複使用跨模態注意模組，以靈活高效的方式支持不同的下游任務，降低微調成本。此外，我們提出的主幹融合策略比堆疊額外的融合特定層更輕量且計算效率更高。在各種VL任務上進行了大量實驗，證明了EgoVLPv2的有效性，它在所有下游任務上實現了一致的最先進性能，超越了強基線。我們的項目頁面位於https://shramanpramanick.github.io/EgoVLPv2/。

English

Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream. Our project page can be found at https://shramanpramanick.github.io/EgoVLPv2/.

EgoVLPv2：具融合功能的自我中心影像語言預訓練於主幹网络

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

摘要

Support