多模態參照分割：綜述研究

摘要

多模態指稱分割旨在基於文本或音頻格式的指稱表達，對視覺場景中的目標物體進行分割，這些場景包括圖像、視頻和3D場景。此任務在需要根據用戶指令精確感知物體的實際應用中扮演著至關重要的角色。在過去十年中，得益於卷積神經網絡、變壓器模型以及大型語言模型的進步，多模態感知能力得到了顯著提升，從而使得這一領域在多模態社區中獲得了廣泛關注。本文對多模態指稱分割進行了全面的綜述。我們首先介紹了該領域的背景，包括問題定義和常用數據集。接著，我們總結了指稱分割的統一元架構，並回顧了在圖像、視頻和3D場景這三大主要視覺場景中的代表性方法。我們進一步探討了廣義指稱表達（GREx）方法，以應對現實世界複雜性的挑戰，同時也涉及相關任務和實際應用。此外，我們還提供了在標準基準上的廣泛性能比較。我們持續在https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation 上追踪相關工作。

English

Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field's background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

多模態參照分割：綜述研究

Multimodal Referring Segmentation: A Survey

摘要

Support