멀티모달 참조 분할: 연구 동향 분석

초록

멀티모달 참조 분할(Multimodal Referring Segmentation)은 텍스트나 오디오 형식의 참조 표현(referring expressions)을 기반으로 이미지, 비디오, 3D 장면과 같은 시각적 장면에서 대상 객체를 분할하는 것을 목표로 합니다. 이 작업은 사용자 지시에 기반한 정확한 객체 인식이 필요한 실제 응용 분야에서 중요한 역할을 합니다. 지난 10년간, 컨볼루션 신경망(CNN), 트랜스포머(Transformer), 대규모 언어 모델(LLM)의 발전으로 인해 멀티모달 인식 능력이 크게 향상되면서, 이 분야는 멀티모달 커뮤니티에서 상당한 주목을 받았습니다. 본 논문은 멀티모달 참조 분할에 대한 포괄적인 조사를 제공합니다. 먼저, 이 분야의 배경과 문제 정의, 일반적으로 사용되는 데이터셋을 소개합니다. 다음으로, 참조 분할을 위한 통합 메타 아키텍처를 요약하고 이미지, 비디오, 3D 장면을 포함한 세 가지 주요 시각적 장면에서의 대표적인 방법들을 검토합니다. 또한, 현실 세계의 복잡성을 해결하기 위한 일반화된 참조 표현(GREx) 방법과 관련 작업 및 실제 응용 분야에 대해 논의합니다. 표준 벤치마크에서의 광범위한 성능 비교도 제공됩니다. 관련 연구는 https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation에서 지속적으로 추적하고 있습니다.

English

Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field's background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

멀티모달 참조 분할: 연구 동향 분석

Multimodal Referring Segmentation: A Survey

초록

Support