Aprendizaje Visual Guiado por Lenguaje Progresivo para la Fundamentación Visual Multitarea

Resumen

La localización visual multitarea (MTVG, por sus siglas en inglés) incluye dos subtareas: la Comprensión de Expresiones Referenciales (REC) y la Segmentación de Expresiones Referenciales (RES). Los enfoques representativos existentes generalmente siguen una línea de investigación que consta principalmente de tres procedimientos centrales: la extracción independiente de características para las modalidades visual y lingüística, respectivamente, un módulo de interacción multimodal y cabezales de predicción independientes para los diferentes subtareas. Aunque han logrado un rendimiento notable, esta línea de investigación presenta dos limitaciones: 1) El contenido lingüístico no se ha inyectado completamente en la red visual para potenciar una extracción de características visuales más efectiva, y requiere un módulo adicional de interacción multimodal; 2) La relación entre las tareas REC y RES no se explota de manera efectiva para ayudar en la predicción colaborativa y obtener resultados más precisos. Para abordar estos problemas, en este artículo proponemos un marco de Aprendizaje Visual Guiado por Lenguaje Progresivo para la localización visual multitarea, denominado PLVL, que no solo explota finamente la expresión inherente de características de la modalidad visual en sí, sino que también inyecta progresivamente la información lingüística para ayudar a aprender características visuales relacionadas con el lenguaje. De esta manera, nuestro PLVL no necesita un módulo adicional de fusión multimodal, al tiempo que introduce completamente la guía del lenguaje. Además, analizamos que el centro de localización para REC ayudaría, en cierta medida, a identificar la región del objeto a segmentar para RES. Inspirados por esta investigación, diseñamos un cabezal multitarea para realizar predicciones colaborativas para estas dos subtareas. Experimentos exhaustivos realizados en varios conjuntos de datos de referencia demuestran de manera integral que nuestro PLVL supera claramente a los métodos representativos en ambas tareas, REC y RES. https://github.com/jcwang0602/PLVL

English

Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

Aprendizaje Visual Guiado por Lenguaje Progresivo para la Fundamentación Visual Multitarea

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

Resumen

Support