マルチタスク視覚的グラウンディングのための段階的言語誘導型視覚学習

要旨

マルチタスク視覚的グラウンディング（MTVG）は、参照表現理解（REC）と参照表現セグメンテーション（RES）という2つのサブタスクを含む。既存の代表的なアプローチは、視覚と言語のモダリティそれぞれに対する独立した特徴抽出、クロスモーダル相互作用モジュール、異なるサブタスクに対する独立した予測ヘッドという3つの主要な手順からなる研究パイプラインに従うことが一般的である。顕著な性能を達成しているものの、この研究ラインには2つの限界がある：1）言語内容が視覚バックボーン全体に十分に注入されておらず、より効果的な視覚特徴抽出を促進するために追加のクロスモーダル相互作用モジュールが必要であること、2）RECとRESタスク間の関係が効果的に活用されておらず、より正確な出力のための協調的予測に役立っていないこと。これらの問題に対処するため、本論文では、マルチタスク視覚的グラウンディングのための漸進的言語誘導視覚学習フレームワーク（PLVL）を提案する。PLVLは、視覚モダリティ自体の内在的な特徴表現を細かく掘り下げるだけでなく、言語情報を漸進的に注入して言語関連の視覚特徴の学習を支援する。この方法により、PLVLは追加のクロスモーダル融合モジュールを必要とせず、言語ガイダンスを完全に導入する。さらに、RECの局所化中心がRESのセグメンテーション対象領域の識別にある程度役立つことを分析する。この調査に基づき、これら2つのサブタスクの協調的予測を達成するためのマルチタスクヘッドを設計する。いくつかのベンチマークデータセットで実施された広範な実験により、PLVLがRECとRESタスクの両方において代表的な手法を明らかに上回ることが包括的に実証された。 https://github.com/jcwang0602/PLVL

English

Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

マルチタスク視覚的グラウンディングのための段階的言語誘導型視覚学習

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

要旨

Support