多模态指代分割:研究综述
Multimodal Referring Segmentation: A Survey
August 1, 2025
作者: Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, Yu-Gang Jiang
cs.AI
摘要
多模态指称分割旨在根据文本或音频形式的指称表达,对视觉场景(如图像、视频和3D场景)中的目标对象进行分割。该任务在需要基于用户指令实现精确对象感知的实际应用中扮演着关键角色。过去十年间,得益于卷积神经网络、Transformer架构及大规模语言模型的进步,多模态感知能力得到了显著提升,这一领域在多模态社区中获得了广泛关注。本文全面综述了多模态指称分割的研究进展。首先,我们介绍了该领域的背景,包括问题定义和常用数据集。接着,我们总结了指称分割的统一元架构,并回顾了在图像、视频和3D场景这三大主要视觉场景中的代表性方法。此外,我们还探讨了广义指称表达(GREx)方法,以应对现实世界复杂性的挑战,并介绍了相关任务及实际应用。文中还提供了在标准基准上的广泛性能对比。我们持续追踪相关研究进展,相关资源可在https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation获取。
English
Multimodal referring segmentation aims to segment target objects in visual
scenes, such as images, videos, and 3D scenes, based on referring expressions
in text or audio format. This task plays a crucial role in practical
applications requiring accurate object perception based on user instructions.
Over the past decade, it has gained significant attention in the multimodal
community, driven by advances in convolutional neural networks, transformers,
and large language models, all of which have substantially improved multimodal
perception capabilities. This paper provides a comprehensive survey of
multimodal referring segmentation. We begin by introducing this field's
background, including problem definitions and commonly used datasets. Next, we
summarize a unified meta architecture for referring segmentation and review
representative methods across three primary visual scenes, including images,
videos, and 3D scenes. We further discuss Generalized Referring Expression
(GREx) methods to address the challenges of real-world complexity, along with
related tasks and practical applications. Extensive performance comparisons on
standard benchmarks are also provided. We continually track related works at
https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.