参考音频-视觉场景中的对象并进行分割
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
July 15, 2024
作者: Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu
cs.AI
摘要
传统的参考分割任务主要集中在静默的视觉场景上,忽视了多模态感知和人类体验中互动的重要作用。在这项工作中,我们引入了一个名为参考音视频分割(Ref-AVS)的新任务,旨在基于包含多模态线索的表达来对视觉领域内的对象进行分割。这些表达以自然语言形式表达,但又融合了包括音频和视觉描述在内的多模态线索。为了促进这项研究,我们构建了第一个Ref-AVS基准,为相应的多模态线索表达中描述的对象提供了像素级注释。为了解决Ref-AVS任务,我们提出了一种新方法,充分利用多模态线索提供精确的分割指导。最后,我们对三个测试子集进行定量和定性实验,将我们的方法与相关任务中现有方法进行比较。结果表明我们的方法的有效性,突显了其利用多模态线索表达精确分割对象的能力。数据集可在以下链接获取:https://gewu-lab.github.io/Ref-AVS。
English
Traditional reference segmentation tasks have predominantly focused on silent
visual scenes, neglecting the integral role of multimodal perception and
interaction in human experiences. In this work, we introduce a novel task
called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment
objects within the visual domain based on expressions containing multimodal
cues. Such expressions are articulated in natural language forms but are
enriched with multimodal cues, including audio and visual descriptions. To
facilitate this research, we construct the first Ref-AVS benchmark, which
provides pixel-level annotations for objects described in corresponding
multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method
that adequately utilizes multimodal cues to offer precise segmentation
guidance. Finally, we conduct quantitative and qualitative experiments on three
test subsets to compare our approach with existing methods from related tasks.
The results demonstrate the effectiveness of our method, highlighting its
capability to precisely segment objects using multimodal-cue expressions.
Dataset is available at
https://gewu-lab.github.io/Ref-AVS{https://gewu-lab.github.io/Ref-AVS}.