參考與分割音視景中的物件

摘要

傳統的參考分割任務主要集中在無聲的視覺場景上，忽略了多模態知覺和互動在人類經驗中的重要作用。在這項工作中，我們引入了一個名為參考音視覺分割（Ref-AVS）的新任務，旨在基於包含多模態提示的表達來對視覺領域內的物體進行分割。這些表達以自然語言形式呈現，但富含包括音頻和視覺描述在內的多模態提示。為了促進這項研究，我們建立了第一個Ref-AVS基準，為相應的多模態提示表達中描述的物體提供像素級注釋。為應對Ref-AVS任務，我們提出了一種新方法，充分利用多模態提示來提供精確的分割指導。最後，我們對三個測試子集進行定量和定性實驗，以比較我們的方法與相關任務中現有方法的效果。結果顯示我們的方法的有效性，突顯了它利用多模態提示表達精確分割物體的能力。數據集可在https://gewu-lab.github.io/Ref-AVS{https://gewu-lab.github.io/Ref-AVS}獲得。

English

Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at https://gewu-lab.github.io/Ref-AVS{https://gewu-lab.github.io/Ref-AVS}.

參考與分割音視景中的物件

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

摘要

Support