参考音频-视觉场景中的对象并进行分割

摘要

传统的参考分割任务主要集中在静默的视觉场景上，忽视了多模态感知和人类体验中互动的重要作用。在这项工作中，我们引入了一个名为参考音视频分割（Ref-AVS）的新任务，旨在基于包含多模态线索的表达来对视觉领域内的对象进行分割。这些表达以自然语言形式表达，但又融合了包括音频和视觉描述在内的多模态线索。为了促进这项研究，我们构建了第一个Ref-AVS基准，为相应的多模态线索表达中描述的对象提供了像素级注释。为了解决Ref-AVS任务，我们提出了一种新方法，充分利用多模态线索提供精确的分割指导。最后，我们对三个测试子集进行定量和定性实验，将我们的方法与相关任务中现有方法进行比较。结果表明我们的方法的有效性，突显了其利用多模态线索表达精确分割对象的能力。数据集可在以下链接获取：https://gewu-lab.github.io/Ref-AVS。

English

Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at https://gewu-lab.github.io/Ref-AVS{https://gewu-lab.github.io/Ref-AVS}.

参考音频-视觉场景中的对象并进行分割

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

摘要

Support