EgoVLPv2: Egocentrische Video-Taal Pre-training met Fusie in de Backbone

Samenvatting

Video-language pre-training (VLP) is steeds belangrijker geworden vanwege zijn vermogen om te generaliseren naar diverse visuele en taaltaken. Bestaande egocentrische VLP-frameworks maken echter gebruik van aparte video- en taalcoders en leren taakspecifieke cross-modale informatie alleen tijdens fine-tuning, wat de ontwikkeling van een uniform systeem beperkt. In dit werk introduceren we de tweede generatie van egocentrische video-language pre-training (EgoVLPv2), een significante verbetering ten opzichte van de vorige generatie, door cross-modale fusie direct in de video- en taalbackbones te integreren. EgoVLPv2 leert sterke video-tekstrepresentaties tijdens pre-training en hergebruikt de cross-modale aandachtmodules om verschillende downstreamtaken op een flexibele en efficiënte manier te ondersteunen, waardoor de kosten van fine-tuning worden verlaagd. Bovendien is onze voorgestelde fusie-in-de-backbone-strategie lichter en rekenkundig efficiënter dan het toevoegen van extra fusiespecifieke lagen. Uitgebreide experimenten op een breed scala aan VL-taken demonstreren de effectiviteit van EgoVLPv2 door consistente state-of-the-art prestaties te behalen ten opzichte van sterke baselines in alle downstreamtaken. Onze projectpagina is te vinden op https://shramanpramanick.github.io/EgoVLPv2/.

English

Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream. Our project page can be found at https://shramanpramanick.github.io/EgoVLPv2/.

EgoVLPv2: Egocentrische Video-Taal Pre-training met Fusie in de Backbone

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Samenvatting

Support