Aprendizado Visual Progressivo Orientado por Linguagem para Fundamentação Visual Multitarefa

Resumo

A ancoragem visual multitarefa (MTVG) inclui duas subtarefas: Compreensão de Expressões Referenciais (REC) e Segmentação de Expressões Referenciais (RES). As abordagens representativas existentes geralmente seguem um pipeline de pesquisa que consiste principalmente em três procedimentos principais: extração independente de características para as modalidades visual e linguística, respectivamente, um módulo de interação cross-modal e cabeças de predição independentes para diferentes subtarefas. Embora tenham alcançado desempenho notável, essa linha de pesquisa apresenta duas limitações: 1) O conteúdo linguístico não foi totalmente injetado em toda a rede visual para impulsionar uma extração de características visuais mais eficaz, exigindo um módulo adicional de interação cross-modal; 2) A relação entre as tarefas REC e RES não é efetivamente explorada para auxiliar na predição colaborativa para uma saída mais precisa. Para lidar com esses problemas, neste artigo, propomos uma estrutura de Aprendizado Visual Progressivo Guiado por Linguagem para ancoragem visual multitarefa, chamada PLVL, que não apenas mina finamente a expressão inerente de características da modalidade visual em si, mas também injeta progressivamente informações linguísticas para auxiliar no aprendizado de características visuais relacionadas à linguagem. Dessa forma, nosso PLVL não requer um módulo adicional de fusão cross-modal, ao mesmo tempo em que introduz plenamente a orientação linguística. Além disso, analisamos que o centro de localização para REC ajudaria, em certa medida, a identificar a região do objeto a ser segmentado para RES. Inspirados por essa investigação, projetamos uma cabeça multitarefa para realizar predições colaborativas para essas duas subtarefas. Experimentos extensivos conduzidos em vários conjuntos de dados de referência comprovam de forma abrangente que nosso PLVL supera significativamente os métodos representativos tanto nas tarefas REC quanto RES. https://github.com/jcwang0602/PLVL

English

Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

Aprendizado Visual Progressivo Orientado por Linguagem para Fundamentação Visual Multitarefa

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

Resumo

Summary

Support

Support