參考與分割音視景中的物件
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
July 15, 2024
作者: Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu
cs.AI
摘要
傳統的參考分割任務主要集中在無聲的視覺場景上,忽略了多模態知覺和互動在人類經驗中的重要作用。在這項工作中,我們引入了一個名為參考音視覺分割(Ref-AVS)的新任務,旨在基於包含多模態提示的表達來對視覺領域內的物體進行分割。這些表達以自然語言形式呈現,但富含包括音頻和視覺描述在內的多模態提示。為了促進這項研究,我們建立了第一個Ref-AVS基準,為相應的多模態提示表達中描述的物體提供像素級注釋。為應對Ref-AVS任務,我們提出了一種新方法,充分利用多模態提示來提供精確的分割指導。最後,我們對三個測試子集進行定量和定性實驗,以比較我們的方法與相關任務中現有方法的效果。結果顯示我們的方法的有效性,突顯了它利用多模態提示表達精確分割物體的能力。數據集可在https://gewu-lab.github.io/Ref-AVS{https://gewu-lab.github.io/Ref-AVS}獲得。
English
Traditional reference segmentation tasks have predominantly focused on silent
visual scenes, neglecting the integral role of multimodal perception and
interaction in human experiences. In this work, we introduce a novel task
called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment
objects within the visual domain based on expressions containing multimodal
cues. Such expressions are articulated in natural language forms but are
enriched with multimodal cues, including audio and visual descriptions. To
facilitate this research, we construct the first Ref-AVS benchmark, which
provides pixel-level annotations for objects described in corresponding
multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method
that adequately utilizes multimodal cues to offer precise segmentation
guidance. Finally, we conduct quantitative and qualitative experiments on three
test subsets to compare our approach with existing methods from related tasks.
The results demonstrate the effectiveness of our method, highlighting its
capability to precisely segment objects using multimodal-cue expressions.
Dataset is available at
https://gewu-lab.github.io/Ref-AVS{https://gewu-lab.github.io/Ref-AVS}.