다중 작업 시각적 그라운딩을 위한 점진적 언어-지시 시각 학습

초록

다중 작업 시각적 그라운딩(Multi-task Visual Grounding, MTVG)은 참조 표현 이해(Referring Expression Comprehension, REC)와 참조 표현 분할(Referring Expression Segmentation, RES)이라는 두 가지 하위 작업을 포함합니다. 기존의 대표적인 접근 방식은 일반적으로 시각적 및 언어적 모달리티에 대한 독립적인 특징 추출, 교차 모달 상호작용 모듈, 그리고 서로 다른 하위 작업을 위한 독립적인 예측 헤드로 구성된 연구 파이프라인을 따릅니다. 이러한 접근 방식은 뛰어난 성능을 달성했지만 두 가지 한계점이 있습니다: 1) 언어적 내용이 전체 시각적 백본에 완전히 주입되지 않아 더 효과적인 시각적 특징 추출을 촉진하지 못하며, 추가적인 교차 모달 상호작용 모듈이 필요합니다; 2) REC와 RES 작업 간의 관계가 효과적으로 활용되지 않아 더 정확한 출력을 위한 협력적 예측에 도움이 되지 않습니다. 이러한 문제를 해결하기 위해, 본 논문에서는 다중 작업 시각적 그라운딩을 위한 점진적 언어-지도 시각적 학습 프레임워크(Progressive Language-guided Visual Learning framework, PLVL)를 제안합니다. PLVL은 시각적 모달리티 자체의 내재적 특징 표현을 세밀하게 탐색할 뿐만 아니라, 언어 정보를 점진적으로 주입하여 언어 관련 시각적 특징을 학습하는 데 도움을 줍니다. 이러한 방식으로, PLVL은 추가적인 교차 모달 융합 모듈 없이도 언어 지도를 완전히 도입할 수 있습니다. 더 나아가, REC의 위치 중심이 RES에서 분할 대상 영역을 어느 정도 식별하는 데 도움이 될 수 있다는 점을 분석했습니다. 이러한 연구를 바탕으로, 두 하위 작업에 대한 협력적 예측을 수행하기 위한 다중 작업 헤드를 설계했습니다. 여러 벤치마크 데이터셋에서 수행된 광범위한 실험을 통해, PLVL이 REC와 RES 작업 모두에서 대표적인 방법들을 뛰어넘는 성능을 보임을 입증했습니다. https://github.com/jcwang0602/PLVL

English

Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

다중 작업 시각적 그라운딩을 위한 점진적 언어-지시 시각 학습

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

초록

Support