Apprendimento Visivo Progressivo Guidato dal Linguaggio per il Grounding Visivo Multi-Task

Abstract

Il grounding visivo multi-task (MTVG) comprende due sotto-task, ovvero la Comprensione delle Espressioni Referenziali (REC) e la Segmentazione delle Espressioni Referenziali (RES). Gli approcci rappresentativi esistenti seguono generalmente una pipeline di ricerca che consiste principalmente in tre procedure fondamentali: l'estrazione indipendente delle caratteristiche per le modalità visiva e linguistica, rispettivamente, un modulo di interazione cross-modale e teste di previsione indipendenti per i diversi sotto-task. Nonostante raggiungano prestazioni notevoli, questa linea di ricerca presenta due limitazioni: 1) Il contenuto linguistico non è stato completamente integrato nell'intero backbone visivo per potenziare un'estrazione più efficace delle caratteristiche visive e richiede un modulo aggiuntivo di interazione cross-modale; 2) La relazione tra i task REC e RES non è sfruttata efficacemente per favorire una previsione collaborativa e ottenere un output più accurato. Per affrontare questi problemi, in questo articolo proponiamo un framework di Apprendimento Visivo Guidato Progressivamente dal Linguaggio per il grounding visivo multi-task, denominato PLVL, che non solo estrae finemente l'espressione intrinseca delle caratteristiche della modalità visiva stessa, ma integra progressivamente le informazioni linguistiche per aiutare a imparare le caratteristiche visive correlate al linguaggio. In questo modo, il nostro PLVL non richiede un modulo aggiuntivo di fusione cross-modale, pur introducendo pienamente la guida linguistica. Inoltre, analizziamo come il centro di localizzazione per REC possa aiutare, in una certa misura, a identificare la regione dell'oggetto da segmentare per RES. Ispirati da questa analisi, progettiamo una testa multi-task per realizzare previsioni collaborative per questi due sotto-task. Esperimenti estensivi condotti su diversi dataset di riferimento dimostrano in modo completo che il nostro PLVL supera nettamente i metodi rappresentativi sia nei task REC che RES. https://github.com/jcwang0602/PLVL

English

Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

Apprendimento Visivo Progressivo Guidato dal Linguaggio per il Grounding Visivo Multi-Task

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

Abstract

Support